Scaling on Multi-core Systems

Introduction

As a preparation for version 0.2 of ØMQ with intended support for seamless scaling on multi-core boxes we've run couple of scaling tests with ØMQ version 0.1 in Intel's Low Latency Lab in London.

Environment

This test was run on 2 multicore boxes (one of them having 8 cores, other one 16 cores) connected using direct 10 gigabit Ethernet link.

The test

The basic test consists of sending 1 million messages as fast as possible, repeated for different message sizes (1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 bytes). Several instances of this test are run in parallel (1, 2, 3 and 4 instances) each passing 1 million messages. The whole test was repeated three times and the results were averaged to give the figures we present.

The test is intended to show the best possible scaling for ØMQ. To that end the parallel message streams are completely independent – for each stream there is separate publishing client thread, publishing worker thread, separate socket, separate receiving worker thread and client thread.

Therefore, each stream should be able to do with 2 threads (CPU cores) on each side of connection (one for client thread and another one for worker thread).

As we've had 16-core box on publisher side and 8-core box on receiver side, we've done the tests for 1, 2, 3 and 4 messages streams (2, 4, 6, 8 threads on each side of the connection). More streams would mean contention for processor cores on the 8-core box. Also, given that the OS itself would need some CPU resources, we've expected the run with 4 independent message streams to be non-representative due to contention for CPU cores between ØMQ and OS. It proved to be so. For this reason we are not presenting result for 4 message streams among the results.

Results

Here is the graph showing throughputs in messages per second for different messages sizes:

multicore1.png

The same results converted into megabits per second:

multicore2.png

It should be noted that we've run the tests also for messages larger than 32 bytes. The results were more or less consistent with the results for small messages. Therefore, as showing results up to 1024 bytes squeezes the interesting values (up to 32 bytes) to the left side of the graph and makes them unreadable, we've opted for skipping results for larger messages from the graph altogether.

Analysis

The nice thing is that we've been able to reach throughputs over 6 million messages a second on the multi-core boxes.

However, our expectation – given that individual streams are completely independent on the ØMQ level and that 10Gb Ethernet has enough bandwidth not to create contention – was that the throughput will scale linearly with the number of message streams. We've found out that this is not the case.

For 8-byte messages, for example, single stream transfers ~ 3 million of messages a second. For two streams the throughput is ~ 5 million of messages per second and for three streams it grows to 6 million.

With linear scaling we would expect 3 millions for single stream, 6 millions for two streams and 9 million for three streams. There is obviously some contention present below the ØMQ level.

Possible contention points:

  • networking hardware
  • operating system (sockets implementation)
  • access to memory

Conclusion

ØMQ can be scaled to multi-core system systems yielding nice results (over 6 million messages a second on 8-core box). However, we should work on getting the results even better as non-linear scaling suggests that there is still some contention going on, creating a bottleneck for the message flow.