ØMQ (version 0.2) tests

Introduction

We performed ØMQ/0.2 tests using Intel's Low Latency Lab. The goal of our tests was to find out what the performance figures are, to compare the performance with ØMQ/0.1 and find the bottlenecks of the system.

Summary

  1. Latency for small messages ~25 us (using Intel's IOAT).
  2. Throughput for single message stream ~2.6 megamessages/second.
  3. Cumulative throughput for multiple message streams on 8-way box ~4 megamessages/sec.
  4. We are able to exhaust 10GbE with messages 512 bytes long.
  5. AMQP throughput for single messages stream ~700,000 messages/second.
  6. Performance of ØMQ/0.2 matches that of the much simpler ØMQ/0.1.

Environment

Box 1
Stoakley/Harpertown 8 cores (4 way 2 socket) 3.2 GHz; 1600 MHz bus; 6 Mb L2 cache; Intel's 10GbE "Oplin" NIC with I/OAT support (PCI-E 4x)
Box 2
Caneland/Tigerton 16 cores (4 way 4 socket) 2.93 GHz; 1066 MHz bus; 4 Mb L2 cache; Intel's 10GbE "Oplin" NIC with I/OAT support (PCI-E 4x)
Connection
The boxes were connected directly, without a switch in the middle.

Latency

Latency was tested by sending a single message back and forth many times and computing the average transport time. Interrupt throttling in the NIC was switched off.

Firstly, we measured ØMQ latencies for different message sizes. The sizes used were powers of two (1, 2, 4, …, 65536 bytes). The following graph shows the results. Note that the X axis is in a logarithmic scale, so that results for small message sizes are well-spaced rather than accumulated on the left hand side of the graph:

tests02-1.png

We found that latencies for messages bellow 4 kB are rather reasonable (sub 50 microseconds) and even messages up to 64 kB are well below 1 millisecond threshold (0.3 millisecond).

Next, we were interested whether ØMQ scales linearly with increasing message sizes. Therefore, we've used the same results as above, however, we've used linear scale instead of logarithmic one to draw the graph. The red curve on the graph below is straight rather than curved, meaning that ØMQ scales linearly with the growing message size, i.e. latency is proportional to the number of bytes in the message.

The black curve represents latencies for raw TCP transport. TCP is linear as well with respect to the message size. Note that the slope of the curve is not that steep, meaning that ØMQ overhead over TCP grows linearly. It would be possible for both curves to have same slope (i.e. ØMQ would add only a constant overhead to underlying TCP) using zero-copy techniques, however, zero-copy would mean no message batching and consequently much lower throughput. In the future we may consider using zero-copy for large messages where the batching is irrelevant anyway.

tests02-2.png

Throughput

To test ØMQ/0.2 throughput, we established a continuous flow of messages between two boxes. On one box application thread was creating and sending messages as fast as possible whereas second thread did the actual sending (poll_thread). On the other box one thread (poll_thread) was reading messages and passing them to the application thread. Throughput was measured in receiver application thread. That way any bottleneck along the message path has impact on the metric.

Interrupt throttle rate on NIC was set to max. 8000 interrupts per second. ØMQ buffers were set to 8 kB.

Throughputs for different message sizes (in megamessages per second) are shown on the graph below. Message sizes used were powers of two, starting with message 1 byte long and ending with message 64 kB long:

tests02-3.png

The strange thing about the graph is that throughputs for messages 1 byte and 2 bytes long are lower than expected. The values may simply be distorted by external influences or they may be caused by internal ØMQ issues. In any case, we don't consider it to be significant as large message flows of messages 1 and 2 bytes long are seen very rarely, if ever.

The following graph shows the results of the same tests converted into megabits of the message content transported per second:

tests02-4.png

As can be seen, ØMQ (in single app thread & single worker thread configuration) is not able to use more than 3 Gb/s for messages up to 64 kB.

Because of methodological problems associated with messages per second metric we prefer to use a metric that we call density. Density is average time interval between two subsequent messages. Have a look at the results of the tests represented in terms of density:

tests02-5.png

Ignoring the results for messages 1 and 2 bytes long, we see that small messages are able to arrive at the receiving application at the rate of approximately one message each 300 nanoseconds.

Scaling on multiple cores

We've did scaling test on multicore boxes for ØMQ/0.1 before. We've found out that the scaling is not linear, i.e. two fully independent and parallel streams of messages don't yield twice the throughput of a single stream. Therefore we've repeated the test for ØMQ/0.2:

tests02-6.png

The scaling has not improved: it is still not linear for small messages. The scaling for large messages can be seen better on megabits/second graph:

tests02-7.png

Even the scaling for large messages is not linear. Say for 64 kB messages, we are able to use 3 Gb/s for a single stream and 4.5 Gb/s for two streams. (If scaling was linear we would expect 6 Gb/s for two streams.) The results are consistent with the tests done for version 0.1. It is our intent to investigate the issue and find the bottleneck.

As for total bandwidth usage, recall that we've used 10Gb Ethernet plugged into 4x slots and thus we've expected maximal throughput of approximately 7 Gb/s. We've found out that the maximal throughput is actually 5.6 Gb/s. Additional tests showed that the bottleneck is the network itself. Possibly, the bandwidth overhead of OSI stack is more that 10% we've estimated to get the 7 Gb/s figure.

In any case, the test showed that we are able to exhaust the 10Gb/4x network with messages of 512 bytes or larger.

As you would expect with non-linear scaling, densities computed for individual streams show that the more streams are used, the worse the density is for any individual stream. So, for example, for messages of 32 bytes, density is one message each 400 nanoseconds when single stream is used. When two streams are used, density drops to one message each 600 nanoseconds.

tests02-8.png

Impact of buffer size on throughput

The ØMQ buffer size specifies the maximum size of message batch. A message batch is a batch of data that is passed to OS (socket) using a single API invocation. The boundaries of the batch don't have to correspond to the boundaries of the message. A large message can be sent in several batches, and many small messages may be contained in a single batch.

Buffer sizes (there are actually two of them, one for reading and one for writing) are very important performance tuning parameter. Large buffer in principle means larger throughput and higher latencies, whereas a small one means lower throughput and better latencies. However, note that large buffer doesn't necessarily have to make latency worse. As ØMQ uses consumer-driven batching, batching is terminated as soon as consumer asks for new data and the batch may therefore be significantly smaller than the buffer size. It can be said that the large buffer size makes worst-case latency higher, although you may never experience the worst case in the real world.

The following graph shows how throughput is affected by different buffer sizes. The test was performed for messages 6 bytes long. Buffer sizes used were powers of two starting with 8 bytes and ending with 8 kB:

tests02-9.png

We can conclude that very small buffer sizes have a severe impact on the throughput. By growing the buffer, you can achieve much higher throughputs, however, there is a limit to the throughput improvement. In this case we see that increasing the buffer over 2kB does not improve the throughput. Obviously, to find best batching size for your application, you should run this kind of test in your specific environment.

AMQP

We haven't fully tested the performance of AMQP engine so far, however, simple tests done on commodity hardware show the throughput of approximately 700,000 messages (6 bytes long) a second.

Conclusion

Although there was whole new level of complexity added to for ØMQ/0.2, compared to ØMQ/0.1, the performance figures stayed almost the same. We consider this a validation of the ØMQ/0.2 design.