Y-suite

WARNING: This text is deprecated and refers to an old version of ØMQ. It remains here for historical interest. DO NOT USE THIS TO LEARN ØMQ.

Introduction

Y-suite is a set of components designed for ultra-efficient passing of messages between threads within a process. Y-suite is somehow similar to local sockets, however, it is much faster.

In version 0.1 of ØMQ lightweight messaging kernel, the only y-suite component available is ypipe, a lock-free and wait-free implementation of a queue. In version 0.2 ypollset is added to allow thread to interchange messages with several other threads at the same time (similar to POSIX poll function). Component known as semaphore in version 0.1 is renamed to ysemaphore in version 0.2 to mark that it belongs to y-suite. Same way, spipe is renamed to ysocketpair.

The source code is available at github.

Design

The basic means of transferring message between threads is ypipe. Messages are passed through a pipe in the standard write and read manner. Once the reader has no more messages to read from the pipe, it notifies the sender using passive synchronization and goes asleep. Passive synchronization means that the other thread is not notified directly using some kind of async signal, rather it will be notified once it tries to write the next message to the pipe. When this happens, writer becomes aware that reader is already asleep or at least going asleep at the moment. It knows that there is new message available, so it wakes the reader up using active synchronization, i.e. actively sending wake-up event to the other thread. Active synchronisation is not provided by ypipe itself, rather by other y-suite components, to be discussed bellow. Usage of ypipe is depicted on the following sequence diagram:

ysuite2.png

Note that ypipe (starting from version 0.2) alows to write messages one-by-one as well as in batches.

As for active synchronisation, there are several components in y-suite that can be used to accomplish the task:

  • ysemaphore is the most simple of them. If can be used to send a single signal (there's no way to use several different signals types with ysemaphore) from one thread to another. Also, ysemaphore may be used to synchronise no more than two threads. On the other hand, it is the fastest active synchronisation mechanism implemented light-weight system mutex rather than heavy-weight system semaphore. Synchronisation takes at worse few hundred nanoseconds.
  • ypollset is a bit slower than ysemaphore, however, it allows to pass up to 31 different signals from several threads to a single thread.
  • ysocketpair allows to send up to 256 different signals from several threads to a single thread. Moreover, there's a file descriptor associated with the ysocketpair so you can use it in call to POSIX poll function. This way you can wait for activity on socket and ypipe at the same time. This is not possible with either ysemaphore or ypollset.

Note that signaling inteface of ysemaphore, ypollset and ysocketpair is standardised, meaning that you can write generic algorithms that fire signals without actually knowing what kind of signaling mechanism is involved.

Following sequence diagram shows the worst case scenario with ypipe/ypollset configuration.

Reader thread tries to read a message from ypipe. It discovers that there are no messages in the pipe, therefore it uses ypollset to wait for a signal. ypollset checks the signals and finds out that there is no signal available, so it suspends the thread. Later on, writer thread writes a message to the pipe and discovers that the pipe is 'dead', i.e. that the reader tried to read messages unsuccessfully and should be woken up using active synchronisation, so it signals the ypollset. The ypollset finds out that reader thread checked for signals unsuccessfully and went asleep, so it wakes it up. The reader thread retrieves the signals and returns from ypollset. Then it is free to read the message from the ypipe.

ysuite1.png

Note that this is the worst case scenario. In most cases the whole process will be much simpler than this.

Performance

We've chosen local sockets to be our baseline. Local socket on Linux is created using socketpair function using AF_UNIX (or AF_LOCAL) domain. We've tested each scenario using local sockets and afterwards the same scenario using y-suite.

The data passed as messages were simple pointers. As both sender and receiver reside in the same process, there's no need to physically marshall the data. Pointer valid for the sender will be valid for the receiver as well.

Configuration

All tests were run on Athlon 64 X2 3800+ box, with Debian GNU/Linux 4.0 (kernel 2.6.22.6, CONFIG_PREEMPT_VOLUNTARY=y, CONFIG_PREEMPT_BKL=y, CONFIG_HZ=1000).

Performance with polling

'Polling' in the scenarios means that in these tests (although there are only two threads involved) the infrastructure - and the corresponding performance overhead - for receiver thread to get messages from several sender threads is already in place. (Note that there's no special infrastructure needed for sender thread to send messages to several receiver threads - there just needs to be a separate pipe for each receiver thread.)

Our first scenario is the worst-case one. Specifically, the worst case tends to be the case where message is sent alone, with no preceding of subsequent messages. Firstly, this fact implies that message cannot be batched (having no other messages to batch with) and therefore cannot get any performance benefit usually associated with batching. Secondly, if message is sent alone it means that receiver thread is asleep waiting for the message. Thus extra overhead of waking up the receiver thread should be taken into consideration.

We've tested this scenario using simple ping-pong test, bouncing a single message back and forth between to threads 1 million times. The test was run thrice and the averages from the runs are presented as the results.

transport density throughput
local sockets 6352 ns/msg 157,430 msgs/sec
y-suite 3555 ns/msg 281,294 msgs/sec

The second scenario is intended to test maximal throughput. Therefore, messages are pushed to the other thread as fast as possible. In this scenario, implementation can take advantage of batching. Recall that ØMQ uses consumer-side batching, i.e. sender sends messages one-by-one, however, receiver receives as much messages as are available at the moment in a single batch.

transport density throughput
local sockets 1926 ns/msg 519,210 msgs/sec
y-suite 292 ns/msg 3,424,658 msgs/sec

Remark: Our local-socket test used MSG_WAITALL to wait for the message rather than full-blown poll. We expect that performance figures with real poll would be even worse.

The last test we did was an experiment with sender-side batching. Although sender-side batching is generally harmfull to latency (because sender has to wait for several messages before sending them), there are special cases where it is useful. Specifically, in the situations where the sender gets messages in batches (say in form of a network packet containing several messages) there's no point in pushing them to the pipe one by one.

The test was run for batch sizes of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 messages. Each test was run thrice and the averages are presented as the resuts.

Following graph shows density values for y-suite for different batch sizes:

density.png

The same values converted to throughputs:

throughput.png

It seems that using appoximately 30 messages a batch yields the best results. Note that batching up to 8 messages is not worth of doing because the overhead associated with the batching makes it even slower than one-by-one message transfer.

Performance without polling

Often, business applications have no need for more than one client thread and one worker thread. In that case optimisation for passing messages between exactly two threads without doing any real polling may yield a better performance (actually, this is how version 0.1 of ØMQ works).

To test the possibilty, we've run the same tests without polling involved - sender thread wakes receiver thread using simple ysemaphore rather than ypollset. The results of the tests follow (we've included the results of the previous tests for comparison):

Worst-case scenario:

transport density throughput
local sockets 6352 ns/msg 157,430 msgs/sec
y-suite (with poll) 3555 ns/msg 281,294 msgs/sec
y-suite (without poll) 3079 ns/msg 324,780 msgs/sec

Maximal throughput scenario:

transport density throughput
local sockets 1926 ns/msg 519,210 msgs/sec
y-suite (with poll) 292 ns/msg 3,424,658 msgs/sec
y-suite (without poll) 263 ns/msg 3,802,281 msgs/sec

The figures are a bit better, however the difference is not big enough to make specific optimisation for exactly-two-threads scenario worth of the trouble.

Conclusion

Y-suite implementation is almost seven times faster than local sockets in high load scenarios. Even in the worst case scenario it is almost twice as fast. Y-suite allows optimisation for a scenario where there are only two threads involved, however, the performance gain is not very high (about 10%).

As for batching messages on sender side, it is inefficient up to 8 messages a batch and it yields best results for batches of 30-60 messages. Larger batches don't add enough throughput to compensate for latency loss.

Comments: 0

Add a New Comment