ØMQ Nanosecond-precision Test

Introduction

The original ØMQ (version 0.1) tests were done without measuring latencies/density on per-message basis because gettimeofday function is quite slow and measuring time for each message would distort the results in serious manner.

In this paper we are focusing on measuring the performance using RDTSC instruction available on many x86- and x86-64-based systems today.

First of all we've measured the overhead of the time measurement using RDTSC instruction. In the following table the overheads for gettimeofday and RDTSC are compared on three different processors each with different CPU frequency:

CPU CPU frequency gettimeofday RDTSC
Intel(R) Core(TM)2 Duo CPU 1 GHz 1437 ns 60 ns
AMD Athlon(TM) 64 X2 3800+ 2 Ghz 1932 ns 7 ns
Intel(R) Pentium(R) 4 3 GHz 103 ns 33 ns

We've found out that overhead imposed by RDTSC instruction is small enough not to distort the results as much as to make them completely irrelevant. Even with message throughput of 3 million messages a second, the time spent processing individual message is 333 nanoseconds. On Pentium 4, adding 33 nanosecond of overhead for measuring send time and 33 nanoseconds for measuring receive time will distort the results by 20%. With decreasing message throughput the distortion grows less significant (10% for 1.5 million messages a second). On Athlon platform the distortion is almost negligible - 4.2% for 3 million messages a second and 2.1% for 1.5 million messages a second.

As for measurement precision, RDTSC measures time in processor ticks, meaning that precission is 1 nanosecond on 1GHz processor, 1/2 nanosecond on 2GHz processor, 1/3 nanosecond on 3GHz processor, etc.

Remark: There are several drawbacks with RDTSC. Firstly, not all x86 based processors support the instruction. Secondly, on multi-core systems, timestamps may vary between individual cores and therefore it doesn't make sense to do any comparison or arithmetic with timestamps provided by two different cores. Thirdly, some processors may change the frequency on the go as a power-saving measure, thus breaking the linear relationship between physical time and processor tick count.

Latency test

We've done the same latency test as in original test suite. However, this time we've measured the latency for each individual message. On the following graph typical latency jitter can be seen:

rdtsc4.png

It can be seen that ØMQ kernel, aside of adding almost no extra latency to the underlying transport (TCP), yields very deterministic latencies for individual messages. The average jitter with respect to the median latency is 0.225 microsecond and the standard deviation of latency is 0.676 microsecond.

Throughput test

We've done the same latency test as in original test suite. However, this time we've measured the density (time needed to process the message) for each individual message.

Following graph shows typical density on the sender side:

rdtsc3.png

In most cases, sending a message takes approximately 140 nanoseconds. Few irregularly distributed peaks appear in the graph. We assume these are caused by OS performing other work. The average density for the whole test consisting of million messages is 208 nanoseconds, equaling to the throughput of 4.8 million messages a second.

On the receiver side, given the batched nature of the ØMQ transport we would expect to see rapid message receiving within a single batch evenly interleaved by slower batch retrieval. This is what we actually see on the graph:

rdtsc2.png

Getting a message from already retrieved batch takes approximately 300 nanoseconds. Retrieving a batch takes few microseconds (3-6).

By computing number of messages processed with density above 1 microsecond we should get fairly accurate number of batches retrieved and thus get an estimate of average messages-per-batch metric. (Recall that in ØMQ, batch size is not set explicitly, rather it is controlled by preparedness of the consumer to get more data.) We've found out that there were 10768 densities above 1 microsecond in the 1 million message test equaling to approximately 100 messages per batch.

The mean density on consumer side is 367 nanoseconds which translates to the throughput of 2.7 million messages a second.

Latency and throughput combined

The combination of throughput and latency, something like 'at most 1 million messages a second with latency below 100 us', would be the most interesting metric to have. However, we believe that such a metric is almost impossible to get for message-queueing systems (as apposed to systems with strict flow control). The argument follows.

First, let's have a look at the latency graph for our 'throughput test':

rdtsc1.png

The ever-growing latency looks scary, however, it is a simple consequence of sender being faster than consumer. In that case messages cannot be consumed in the rate that would keep up with the message production and therefore they have to be queued. Consequently, the latency of a message transfer is composed of both actual time needed to transfer the message and the time spent in the queue. As the rate of production is greater than the rate of consumption, the queue grows all the time and each subsequent message has to spend more and more time in the queue waiting for all preceding messages to be consumed.

In our test the first message was transferred rapidly, however the latency grew for every subsequent message until it ultimately reached 0.16 second for the millionth message.

It looks tempting to deliberately slow down the message production, so that producer is slower than consumer. In that case there would be no need to queue the messages and the latency would be precise. This being completely true, we should be aware that this way the throughput metric would be utterly useless. By slowing down the producer, it becomes the bottleneck of the system and the overall throughput will be equal to the message message production rate, i.e. we won't be measuring the throughput of ØMQ but rather the message production rate.

Finally, we could possibly tune the producing application so that it's message production rate matches the maximal message consumption rate of the consuming application. However, even in this case there is a problem. We expect the performance figures in such a scenario to differ significantly between the runs. The idea is that when production and consumption rates are precisely synchronised even a small jitter on either side could cause the messaging to switch between queueing and non-queueing modes yielding unpredictable non-deterministic overall performance figures.

The solution of the problem in our opinion is to perform the latency and the throughput tests separately. When measuring maximal throughput we should be ignorant about whether messages are queued or not and therefore we shouldn't care about latency. When measuring latency, we should do so for a fixed message rate lower than maximal consumption rate. Consider the latency test above. It should be actually interpreted like this: 'ØMQ is able to pass messages with 50 microsecond latency at fixed message rate of 10,000 messages per second.' However, keep in mind that this statement in no way implies that the latency would be worse for 100,000 messages a second, million messages a second or 10 million messages a second or better for 1,000 messages a second, 100 messages a second or 1 message a second.

Conclusion

In a controlled environment RDTSC instruction can be used to measure time rapidly. This allows us to measure latency/density for individual messages instead of computing averages for the whole test.

We've used this approach to get performance figures of ØMQ lightweight messaging kernel (version 0.1) and we've got following results:

  • In low-volume case the latency is almost the same as the latency of the underlying transport (TCP): 50 microseconds.
  • The average jitter of latency is minimal: 0.225 microsecond.
  • The throughput on sender side is 4.8 million messages a second.
  • The density on sender side is mostly about 0.140 microsecond, however, with occasional peaks the mean density is 0.208 microsecond.
  • The throughput on receiver side is 2.7 million messages a second.
  • The density on receiver side is mostly about 0.3 microsecond. Approximately each 100 messages new batch is received causing density to grow to 3-6 microseconds. The mean density is 0.367 microsecond.