Heartbeating and Keep-Alive

This page explores a pattern for client-server heartbeating using a ping-pong mechanism. The use cases and design are taken from the IETF's HyBi list. Please edit or comment as you like.

Use Case

A server manages a set of clients. It wants to know what clients are "online" and "offline" for network management reasons. "Online" is defined as a client that showed signs of life within a certain timeout. Clients similarly want to know if the server is online, even if it is not otherwise sending data to them.

Basic Design Proposal

The basic pattern consists of a ping-pong dialog. One peer sends a PING command to the other, which replies with a PONG command. Neither command has any payload. Pings and pongs are not correlated. Since the roles of "client" and "server" are sometimes arbitrary, we specify that either peer can in fact send a ping and expect a pong in response. However, since the timeouts depend on network topologies known best to dynamic clients, it is usually the client which pings the server. We will assume this in the following design.

If we assume that a network can delay a ping by up to D seconds, and our timeout is T, the client sends a ping command every (T - D) seconds, if it is not sending any other data. The client may in fact send pings together with other data; since ping commands are very small, this will not have significant impact and will usually simplify client design.

The client may continue to send pings even if it hasn't received pongs back from the server, since pongs can be delayed by the network and by other data being sent from the server. In both client and server, any incoming data should be treated as an indicator the peer is "online".

Alternatives

We've explored several alternatives in different drafts on rfc.zeromq.org. These are:

  • Unilateral "I'm alive" heartbeat commands, with no responses. These are extremely simple and require no state or counting in peers. Peers send heartbeats when they are otherwise quiet. And a peer that is quiet for longer than a certain timeout is marked "offline".
  • Ping-pong commands from client to server. These allow dynamic clients to define liveless. For some clients it might be 1 second, for others it might be 30 seconds. It does not give servers a way to define liveness.