Requirements For Reliability

One of the main questions that AMQP users have when coming to 0MQ is how we do reliability.

So I'm collecting requirements for reliability on top of 0MQ. My idea is to build this as an application layer on top of 0MQ, using a different API and a separate protocol that sits on top of the 0MQ SPB framing.

Here is a sketch of what seems to be the simplest design for fire-and-forget reliability:

  • It covers only one-to-one delivery of messages (e.g. for trade execution or request-response).
  • The publisher delivers a message to a local dispatcher service (perhaps based on a 0MQ device).
  • The dispatcher queues the message on persistent storage and the publisher continues.
  • The dispatcher in a second thread delivers messages to their destination.
  • Recipients reply to messages with an acknowledgement.
  • Recipients store received messages and if they get a duplicate, they discard it but resend the acknowledgement.
  • The dispatcher marks acknowledged messages, and ignores duplicate acknowledgements.

In the most brutal implementation, both sides use a light database to hold messages, and store everything. There are more intelligent ways to hold persistent queues but that's an implementation detail.

The above design covers crashes in the publisher, consumer, and network afaics. Throughput would depend on message size, since messages can easily be batched and acknowledged in blocks.

Has anyone been implementing such reliability on top of 0MQ already?