Introduction
The point of this test was to find out whether Linux kernel AI/O (libaio1) can solve the problem that we had with sync I/O, namely, that each write required 8.3 ms, i.e. time to rotate the disk to get the head to the correct position.
Scenario
Write cache on the disk is turned off. NCQ is turned on. Speed of the disk is 7500 rpm.
Data are writen to a raw device (using O_DIRECT) in a linear manner. Each data chunk is 512 bytes long. Size of AIO queue is set so that it can hold all the write requests. All the write requests are posted as fast as possible. The times when individual posts happened are stored. Separate thread waits for notifications of write completion and stores the times when they are received. Check the code below for the details.
#include <assert.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
#include <signal.h>
#include <pthread.h>
#include <time.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>
#include <libaio.h>
#include <libio.h>
#include <fstream>
#define BLOCK_COUNT 1000
uint64_t now_usecs ()
{
timeval tv;
int rc = gettimeofday (&tv, NULL);
assert (rc == 0);
return ((time_t) tv.tv_sec) * 1000000 + tv.tv_usec;
}
io_context_t aioctx;
uint64_t enqueue_times [BLOCK_COUNT];
uint64_t dequeue_times [BLOCK_COUNT];
void* dequeue_thread(void*)
{
int rc;
// Iterator through all the notifications
for (int block = 0; block != BLOCK_COUNT; block ++) {
// Get one notification
io_event event;
rc = io_getevents (aioctx, 1, 1, &event, NULL);
iocb *cb = (struct iocb*) event.obj;
// Store a timestamp
dequeue_times [block] = now_usecs ();
// Check that write was performed without any problems
assert (rc == 1);
assert (event.res2 == 0);
assert (event.res == cb->u.c.nbytes);
}
return NULL;
}
int main (int argc, char *argv [])
{
if (argc != 2) {
printf ("usage: aio <device name>\n");
return 1;
}
int rc;
// Initialise AIO queue
memset(&aioctx, 0, sizeof(aioctx));
rc = io_queue_init (BLOCK_COUNT, &aioctx);
assert (rc == 0);
// Open raw device for writing
int fd = open (argv [1], O_DIRECT | O_RDWR);
assert (fd != -1);
// Prepare the data to write
char *buffer = (char*) malloc (1024);
assert (buffer);
buffer = (char*) ((((long long) buffer) / 512) * 512) + 512;
memset(buffer, 'D', 512);
// Run the notification dequeueing thread
pthread_t worker;
rc = pthread_create (&worker, NULL, dequeue_thread, NULL);
assert (rc == 0);
// Now write WRITE_COUNT blocks of 512 bytes as fast as possible
// Allocate and clear write context blocks
iocb cbs [BLOCK_COUNT];
memset (cbs, 0, sizeof (iocb) * BLOCK_COUNT);
for (int block = 0; block != BLOCK_COUNT; block ++) {
// Store a timestamp
enqueue_times [block] = now_usecs ();
// Enqueue the write request
iocb *cb = cbs + block;
io_prep_pwrite (cb, fd, buffer, 512, 0);
rc = io_submit (aioctx, 1, &cb);
assert (rc == 1);
}
// Wait for notification dequeueing thread to finish
rc = pthread_join (worker, NULL);
assert (rc == 0);
// Close the device
rc = close (fd);
assert (rc != -1);
// Store the results
std::fstream enqueue ("enqueue.txt", std::fstream::out);
for (int block = 0; block != BLOCK_COUNT; block ++)
enqueue << enqueue_times [block] << std::endl;
std::fstream dequeue ("dequeue.txt", std::fstream::out);
for (int block = 0; block != BLOCK_COUNT; block ++)
dequeue << dequeue_times [block] << std::endl;
return 0;
}
Results
We've run the test with all the available I/O schedulers. For each scheduler the first graph shows when a write request was enqueued (red point) and when the notification was received (black point). The second graph shows latencies for individual writed, i.e. notification time minus enqueue time.
Noop scheduler
CFQ scheduler
Anticipatory scheduler
Deadline scheduler
Conclusion
Linux kernel AI/O helps with the 8.3ms issue. It does so by batching ~30 write request into a single unit, so that each individual write doesn't require getting the head to the right position.
The impact of I/O scheduling algorithm on this scenario is more or less zero. The only difference we've seen is that deadline scheduler performs a bit better for few initial write requests.
There are two problems we've encountered:
- The enqueueing of writes seems to block every now and then although the size of AIO queue was large enough to hold all the requests in the test.
- Although 8.3ms peak for each write was removed, the peaks for first message in a batch are much higher than 8.3 us. (~30-60 ms).