Linux kernel AI/O test

Introduction

The point of this test was to find out whether Linux kernel AI/O (libaio1) can solve the problem that we had with sync I/O, namely, that each write required 8.3 ms, i.e. time to rotate the disk to get the head to the correct position.

Scenario

Write cache on the disk is turned off. NCQ is turned on. Speed of the disk is 7500 rpm.

Data are writen to a raw device (using O_DIRECT) in a linear manner. Each data chunk is 512 bytes long. Size of AIO queue is set so that it can hold all the write requests. All the write requests are posted as fast as possible. The times when individual posts happened are stored. Separate thread waits for notifications of write completion and stores the times when they are received. Check the code below for the details.

#include <assert.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
#include <signal.h>
#include <pthread.h>
#include <time.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>
#include <libaio.h>
#include <libio.h>
#include <fstream>

#define BLOCK_COUNT 1000

uint64_t now_usecs ()
{
    timeval tv;
    int rc = gettimeofday (&tv, NULL);
    assert (rc == 0);
    return ((time_t) tv.tv_sec) * 1000000 + tv.tv_usec;
}

io_context_t aioctx;

uint64_t enqueue_times [BLOCK_COUNT];
uint64_t dequeue_times [BLOCK_COUNT];

void* dequeue_thread(void*) 
{
    int rc;

    //  Iterator through all the notifications
    for (int block = 0; block != BLOCK_COUNT; block ++) {

        //  Get one notification
        io_event event;
        rc = io_getevents (aioctx, 1, 1, &event, NULL);
        iocb *cb = (struct iocb*) event.obj;

        //  Store a timestamp
        dequeue_times [block] = now_usecs ();

        //  Check that write was performed without any problems
        assert (rc == 1);
        assert (event.res2 == 0);
        assert (event.res == cb->u.c.nbytes);
    }

    return NULL;
}

int main (int argc, char *argv [])
{
    if (argc != 2) {
        printf ("usage: aio <device name>\n");
        return 1;
    }

    int rc;

    //  Initialise AIO queue
    memset(&aioctx, 0, sizeof(aioctx));
    rc = io_queue_init (BLOCK_COUNT, &aioctx);
    assert (rc == 0); 

    //  Open raw device for writing
    int fd = open (argv [1], O_DIRECT | O_RDWR);
    assert (fd != -1);

    //  Prepare the data to write
    char *buffer = (char*) malloc (1024);
    assert (buffer);
    buffer = (char*) ((((long long) buffer) / 512) * 512) + 512;
    memset(buffer, 'D', 512);

    //  Run the notification dequeueing thread
    pthread_t worker;
    rc = pthread_create (&worker, NULL, dequeue_thread, NULL);
    assert (rc == 0);

    //  Now write WRITE_COUNT blocks of 512 bytes as fast as possible

    //  Allocate and clear write context blocks
    iocb cbs [BLOCK_COUNT];
    memset (cbs, 0, sizeof (iocb) * BLOCK_COUNT);

    for (int block = 0; block != BLOCK_COUNT; block ++) {

        //  Store a timestamp
        enqueue_times [block] = now_usecs ();

        //  Enqueue the write request
        iocb *cb = cbs + block;
        io_prep_pwrite (cb, fd, buffer, 512, 0);
        rc = io_submit (aioctx, 1, &cb);
        assert (rc == 1);
    }

    //  Wait for notification dequeueing thread to finish
    rc = pthread_join (worker, NULL);
    assert (rc == 0);

    //  Close the device
    rc = close (fd);
    assert (rc != -1);

    //  Store the results
    std::fstream enqueue ("enqueue.txt", std::fstream::out);
    for (int block = 0; block != BLOCK_COUNT; block ++)
        enqueue << enqueue_times [block] << std::endl;
    std::fstream dequeue ("dequeue.txt", std::fstream::out);
    for (int block = 0; block != BLOCK_COUNT; block ++)
        dequeue << dequeue_times [block] << std::endl;

    return 0;
}

Results

We've run the test with all the available I/O schedulers. For each scheduler the first graph shows when a write request was enqueued (red point) and when the notification was received (black point). The second graph shows latencies for individual writed, i.e. notification time minus enqueue time.

Noop scheduler

noop.png
noop_lat.png

CFQ scheduler

cfq.png
cfq_lat.png

Anticipatory scheduler

anticipatory.png
anticipatory_lat.png

Deadline scheduler

deadline.png
deadline_lat.png

Conclusion

Linux kernel AI/O helps with the 8.3ms issue. It does so by batching ~30 write request into a single unit, so that each individual write doesn't require getting the head to the right position.

The impact of I/O scheduling algorithm on this scenario is more or less zero. The only difference we've seen is that deadline scheduler performs a bit better for few initial write requests.

There are two problems we've encountered:

  1. The enqueueing of writes seems to block every now and then although the size of AIO queue was large enough to hold all the requests in the test.
  2. Although 8.3ms peak for each write was removed, the peaks for first message in a batch are much higher than 8.3 us. (~30-60 ms).