ØMQ Blog

Building ØMQ and pyzmq on Red Hat - 24 May 2010 09:37 - by martin_sustrik - Comments: 2

Quite a few people have run into issues with building ØMQ on stable distributions of Linux, such as RHEL or CentOS. Given that build tools shipped with such distributions are stable (=old) they fail to build ØMQ out of the box. Blog by Ryan Duffield explains how to deal with the issue.

Read more....

To Trie or not to Trie - 18 May 2010 06:30 - by martin_sustrik - Comments: 0

Have you ever thought about what exactly happens when you subscribe for a topic using SUB socket? The messages have to be checked and those not matching the subscription(s) have to be dropped.

Currently, matching is done using an N-ary search tree, a structure known as trie.

Searching a trie is very fast. The search time is independent of overall number of topics or number of subscriptions. It's linearly dependent only on the length of the topic string in the message being matched. However, with large number of subscriptions the trie can consume more memory than alternative search structures such as hash table. Larger memory usage means that memory cache misses happen more often and slow the algorithm down.

Bhavin Turakhia did a research on trie optimisation and blogged about it.

Read more....

ØMQ/2.0.6 on OpenVMS - 12 May 2010 13:02 - by martin_sustrik - Comments: 3

BC&JA are pleased to announce that a binary version of 0MQ V2.0.6 has been released for Alpha and Integrity, OpenVMS 8.3 and higher…

Read more....

Zero-copy and Multi-part Messages - 08 May 2010 09:58 - by martin_sustrik - Comments: 17

In high performance networking copying data is considered harmful to performance and avoided as much as possible. The technique of avoiding all the copies is known as "zero-copy".

This article demonstrates the impact of single copy of the data on latency. It shows, for example, that for 256MB of data, single copy can increase latency by 0.1 second!

Obviously, data are copied from memory to network interface card and vice versa, they are copied on user space/kernel space boundary etc. This article in Linux Journal gives detailed explanation of what's going on under the hood of the operating system and what are the ways to get as close to the zero-copy as possible.

However, in this blog we are going to discuss only a single instance of copying the data, namely copying user data into ØMQ messages.

Consider the following example. We'll create a message million bytes long and copy the user data into it before sending:

zmq_msg_t msg;
zmq_msg_init_size (&msg, 1000000);
memcpy (zmq_msg_data (&msg), buffer, 1000000);
zmq_send (s, &msg, 0);

The memcpy part looks suspicious. We have the data in the buffer already so why not send the buffer itself instead of copying it to the message? Is ØMQ capable of such thing?

Actually, yes. It is and it has always been. All we have to do is to define deallocation function for the buffer and pass it to ØMQ along with the buffer:

void my_free (void *data, void *hint)
{
    //  We've allocated the buffer using malloc and
    //  at this point we deallocate it using free.
    free (data);
}

Once the deallocation function is defined we can create a "zero-copy" message and pass it the buffer and deallocation function:

zmq_msg_t msg;
void *hint = NULL;
zmq_msg_init_data (&msg, buffer, 1000000, my_free, hint);
zmq_send (s, &msg, 0);

Note that the buffer is now owned by the message. It will be deallocated once the message is sent. We must not deallocate the buffer ourselves!

Also note the hint parameter. It can be used if more complex allocation mechanism is used. Say we allocated the chunk using some "allocator" object and we have to deallocate it via the same object. In such case we can pass the pointer to allocator as a hint to zmq_msg_init_data and modify the deallocation function as follows:

void my_free (void *data, void *hint)
{
    ((allocator_t*) hint)->free (data);
}

We've got rid of the copying, right?

Well, not entirely. In some cases the above may work. In other cases it is insufficient.

Consider the case when we have two large matrices — each 100MB long — which we want to transfer. Unfortunately they are not contiguous in the memory. Each was allocated using separate malloc invocation and thus we cannot describe both using single data pointer.

Why not send them as two separate messages then? Consider say REQ socket. It load balances messages. In other words, if there are two REP sockets connected to it, sending two messages would result in first matrix being dispatched to one REP socket while the second to the other REP socket. This is not what we want. We want the two matrices to form an atomic unit of transfer. They should never be split apart.

It seems that in this case we need something equivalent to POSIX gather arrays. For those unfamiliar with Berkeley socket API, gather array is an array of data chunks that's sent to the networking stack using a single call.

But would that account for all possible scenarios?

There's still a scenario where it won't help. namely, when the two matrices don't exist at the same time. First one is created, sent and deallocated, then second one. In such case the gather array would be of no use. There's no single point in time when we own all the data and thus are able to fill in the gather array.

The new feature in ØMQ called "multi-part message" solves the problem. To put it simply, it allows you to concatenate multiple messages into a single message:

zmq_msg_t msg1;
zmq_msg_init_data (&msg1, matrix1, matrix1_size, my_free, NULL);
zmq_send (s, &msg, ZMQ_SNDMORE);
zmq_msg_t msg2;
zmq_msg_init_data (&msg2, matrix2, matrix2_size, my_free, NULL);
zmq_send (s, &msg, 0);

It looks almost exactly as if you were sending two separate messages except for passing ZMQ_SNDMORE flag to the first send. The flag says: "Hold on! There are more data going to be added to this message!"

The important point to note is that although all parts of the message are treated as a single atomic unit of transfer, the boundaries between message parts are strictly preserved. In other words, if you send a message consisting of two message parts, each 100 bytes long, on the other side you'll never receive a single message part 200 bytes long. Or two message parts, 50 and 150 bytes long. Or even four message parts, each 50 bytes long. You'll get exactly what you've sent — two message parts, each 100 bytes long in the same order as they were sent.

This fact allows for using multi-part messages for adding coarse-grained structure to your message. The example with two matrices illustrates the point. You send the two matrices as two message parts and thus avoid the copy. However, at the same time the matrices are cleanly separated, each residing in its own message part and you are guaranteed that the separation will be preserved even on the receiving side. Consequently you don't have to put matrix size into the message or invent any kind of "matrix delimiters".

Another interesting use of multi-part messages is to combine them with PUB/SUB sockets. Publish/subscribe messaging pattern allows for subscribing for particular subset of messages. Subscription is a chunk of data supplied by receiver, saying "please, send me all the messages beginning with these data":

zmq_setsockopt (s, ZMQ_SUBSCRIBE, "ABC", 3);

Obviously, sender has to place the appropriate data at the beginning of the message to make it delivered to the specific subscriber:

zmq_msg_t msg;
zmq_msg_init_size (&msg, 6);
memcpy (zmq_msg_data (&msg), "ABCxyz", 6);

The part of the message that is checked against the subscriptions is called topic. In our case the topic is "ABC".

When the topic is of variable length you need a delimiter to separate is from the rest of the message, so that subscription mechanism doesn't incidentally consider beginning of the data to be a continuation of the topic. Following example uses "pipe" symbol as delimiter:

zmq_msg_t msg;
zmq_msg_init_size (&msg, 7);
memcpy (zmq_msg_data (&msg), "ABC|xyz", 7);

While this works, it's a bit ugly. Even more importantly, if the topic happens to be binary data, there's no spare symbol we can use as the delimiter.

Elegant solution is to use a two-part message. Subscriptions are always evaluated only against the first message part, so we can place the topic into the first message part while the rest of the data into the second one (or even into several subsequent message parts):

zmq_msg_t topic;
zmq_msg_init_size (&topic, 3);
memcpy (zmq_msg_data (&topic, "ABC", 3);
zmq_send (s, &topic, ZMQ_SNDMORE);
zmq_msg_t value;
zmq_msg_init_size (&value, 3);
memcpy (zmq_msg_data (&value, "xyz", 3);
zmq_send (s, &value, 0);

One final remark. When receiving a message you may know that each message consists of two parts, say "topic" and "value". However, in other scenarios you may have no idea how many message parts there are in the message. In such case ØMQ allows you to ask the socket whether there are more message parts to be received or not. This is done using ZMQ_RCVMORE socket option:

zmq_recv (s, &msg, 0);
int64_t more;
size_t more_size = sizeof (more);
zmq_getsockopt (s, ZMQ_RCVMORE, &more, &more_size);
if (more) ...

If you want to become a ØMQ blogger, contact us.

All posts

07 Feb 2011 10:21: Installing ZeroMQ and Java bindings on Ubunu 10.10 comments.png 0
03 Feb 2011 09:11: Python Multiprocessing With ØMQ comments.png 1
01 Feb 2011 21:29: ØMQ And QuickFIX comments.png 3
23 Dec 2010 09:33: Message Oriented Network Programming comments.png 0
19 Dec 2010 13:03: ØMQ/2.1.0 available on OpenVMS comments.png 0
13 Dec 2010 12:46: Load Balancing In Java comments.png 1
10 Nov 2010 10:01: Building a GeoIP server with ZeroMQ comments.png 0
31 Oct 2010 11:44: ØMQ blog by Gerard Toonstra comments.png 0
13 Oct 2010 10:47: ØMQ - The Game comments.png 0
09 Sep 2010 07:45: ZeroMQ and Clojure, a brief introduction comments.png 0
03 Sep 2010 18:07: ZeroMQ: Modern & Fast Networking Stack comments.png 0
02 Sep 2010 13:30: September Meetups comments.png 0
01 Sep 2010 16:05: Multithreading Magic comments.png 0
01 Sep 2010 11:25: Not To Be Confused comments.png 2
27 Aug 2010 05:39: ØMQ for Clojure comments.png 0
17 Aug 2010 07:20: ZeroMQ: What You Need to Know Braindump comments.png 0
03 Aug 2010 10:46: RFC: ØMQ Contributions, Copyrights and Control comments.png 1
02 Aug 2010 19:54: New - ØMQ Labs! comments.png 0
26 Jul 2010 14:02: Video Introduction to ØMQ comments.png 3
13 Jul 2010 20:19: ØMQ available at CPAN comments.png 1
12 Jul 2010 12:22: Welcome, Objective-C :-) comments.png 0
10 Jul 2010 13:42: Few Blogposts on ØMQ comments.png 2
29 Jun 2010 08:19: Ruby Gem for ØMQ/2.0.7 available! comments.png 0
23 Jun 2010 09:27: ZeroMQ an introduction comments.png 1
18 Jun 2010 10:15: Mongrel2 Is "Self-Hosting" comments.png 0
08 Jun 2010 21:25: Internet Worldview in Messaging World comments.png 0
07 Jun 2010 16:13: Berlin Buzzwords 2010 comments.png 1
05 Jun 2010 10:24: Loggly Switches to ØMQ comments.png 4
04 Jun 2010 18:05: ØMQ/2.0.7 (beta) released comments.png 12
24 May 2010 09:37: Building ØMQ and pyzmq on Red Hat comments.png 2
18 May 2010 06:30: To Trie or not to Trie comments.png 0
12 May 2010 13:02: ØMQ/2.0.6 on OpenVMS comments.png 3
08 May 2010 09:58: Zero-copy and Multi-part Messages comments.png 17
07 Apr 2010 10:23: The Long and Winding Road Behind comments.png 2
06 Apr 2010 16:41: Historical Highlights comments.png 0

Posts by date