Measuring messaging performance

# Introduction

Performance measurement is done for two basic reasons.

Firstly, we want to have some figures for marketing our messaging system, secondly we need a diagnostic tool to help us with eliminating performance problems during product development. As for marketing figures, we want to say for example that we are able to transport half a million of 512-byte messages a second with the latency up to 100 microseconds.

However, it would be even better to supply more information about the behaviour of the system: Are there peaks in the latency? If so, how often do they occur? How high they are? Same for the throughput: Are there peaks? How often? How high?

We would also like to show how the performance changes with respect to the messages size: What are the performance figures for 1-byte messages? What are the numbers for 1MB messages?
We would like to demonstrate how CPU power affects the system: What's the performance with single-core processor? What is it like with double-core one? How does processor frequency affect the performance? Does messaging use 100% of the processor time?

It would be nice to measure the impact of various network architectures as well: What's the difference when using 1Gb network vs. 100Mb one? What about Infiniband? How does the system behave when loopback interface is used instead of regular network connection?

As for the diagnostics for the development, a different kind of performance measuring is needed. Rather than computing aggregate figures we are interested in the whole latency and throughput curves. We are not interested in whether there are peaks in latency, we are interested in where exactly the peaks occur. At the beginning of the test? When the load increases? Is the latency peaking just for the few last messages? Etc.

# General principles

1. The system we are measuring is discrete. The messages leave and arrive at a single point of time (at least that's how we measure it). Therefore we cannot assume that the latency and throughput curves are continuous – they are non-continuous by principle. As a consequence we cannot measure latency or throughput for a specific point in time. It is undefined. However, we can measure latency or throughput for a specific message(s).
2. The clock on different physical boxes are dissynchronised. We cannot assume that the clock on different machines are synchronised with the precision we need (microseconds or even nanoseconds). Therefore we have to treat time values measured on different boxes as being measured in different units, i.e. we cannot subtract time measured on one box from time measured on the different box to get timespan – that kind of operation is undefined. However, timespans measured on different boxes are compatible and we can perform any kind of operation on them (addition, subtraction, etc.)

# Terminology

N
Number of messages in the test
On
Time when n-th message was sent
In
Time when n-th message was received
Ln
Latency of n-th message (in seconds)
SOn
Sending overhead for the n-th message (in seconds)
ROn
Receiving overhead for the n-th message (in seconds)
STn
Throughput at the time n-th message is being sent (in messages per second)
RTn
Throughput at the time n-th message is being received (in messages per second)
SCn
CPU usage on the sender at the time n-th message is being sent (in %)
RCn
CPU usage on the sender at the time n-th message is being received (in %)
BCn
CPU usage on the broker at the time n-th message is being processed (in %)

# Model

Because of dissynchronised clock on the different boxes, we have to measure both On and In on the same box, i.e. both sender and receiver have to run on the same machine.

To be able to measure performance with actual network involved we therefore have to add a middle box ("broker") to the model.

The testing application should therefore work like this:

1. On and SCn is measured and the message is sent by the sender to the broker
2. BCn is measured and the message is sent by the broker to the receiver (in case broker is non-trivial we can as well measure time spent in the broker – say time message spends in a queue)
3. In and RCn is measured by the receiver

To minimise the noise in the measurement, following criteria should be met:

1. There is a single processor/core dedicated to the sender to avoid other processes to intervene with the measurement.
2. There is a single processor/core dedicated to the broker to avoid other processes to intervene with the measurement.
3. There is a single processor/core dedicated to the receiver to avoid other processes to intervene with the measurement.
4. There is a separate network dedicated to the test to avoid other network traffic to intervene with the measurement.
5. Measured values should be stored in a preallocated memory rather than in a file to avoid I/O operations to intervene with the measurement.
6. The sender and the receiver should have separate connections to the broker to minimise mutual blocking. Also, each should be executed in a separate thread for the same reason.
7. Any statistical computations should be executed after the test have ended so that they don't intervene with the measurement.

# Measuring Latency

Measuring latency is easy. By having both the sender and the consumer on the same box we don't have to care about clock synchronisation. Latency is the time interval between sending the message and receiving the message:

(1)
$$L_n =I_n - O_n$$

NB that latency is measured for a particular message, not for a particular point of time.

# Measuring Throughput

Although there is an obvious way to compute throughput for the test that spans several thousand messages

(2)
\begin{align} I_n - O_1 \over N \end{align}

this metric becomes almost useless as number of messages in the test declines towards one. The lower the number of messages, the more the value becomes latency-like and less throughput-like. Actually, for N=1, the figure is exactly equal to the latency.
The problem is that there is no way to aggregate both sending and receiving rate consistently into a single "throughput" metric. To solve this problem, we will treat sending rate (STn) and receiving rate (RTn) separately.

To get the actual numbers we will first compute "sending overhead" (SOn) and "receiving overhead" (ROn), the former meaning the time needed to send the message, the latter meaning the the time to receive the message:

(3)
$$SO_n = O_n - O_{n - 1}$$
(4)
$$RO_n = I_n - I_{n -1}$$

The sender throughput can be then expressed as the inverse value of sender overhead and receiver throughput as the inverse value of the receiver overhead:

(5)
\begin{align} ST_n = {1 \over SO_n} \end{align}
(6)
\begin{align} RT_n = {1 \over RO_n} \end{align}

Consider two messages, one being sent 4.5 seconds after the beginning of the test, second one 4.7 seconds after the beginning of the test. Sender overhead is 0.2 seconds, meaning that sender requires 0.2 seconds to fully process the first message before it starts processing the second one. To extrapolate this value (How many messages would we be able to send in one second if the overhead is constant?) we have to compute the inverse value: 1 / 0.2 = 5. Therefore sender throughput is 5 messages per second.

Note that both sender throughput and receiver throughput are measured for a particular message, not for a particular point of time.

However, this approach has a problem when the throughput is high. Consider the time being measured in microseconds as it is done on most Linux systems. In that case the timespan between two subsequent messages can be 1 microsecond, 2 microseconds, 3 microseconds, 4 microseconds, etc. Inverting the value we'll get possible throughput values of 1,000,000 messages a second, 500,000 messages a second, 333,333 messages a second, 250,000 messages a second etc.

This is too coarse-grained to be of any use. Moreover, if two messages are processed within the same microseconds, throughput becomes undefined (division by zero happens).

The solution is to measure the time in nanoseconds. That way we will be pretty precise even if throughput is over 1,000,000 messages a second. However, most operating systems don't allow for that amount of precision when measuring time. Therefore we have to measure throughput using M messages instead of two of them. (Note that simple throughput measuring as described above is just a special case "M=1" of this generalised metric.)

(7)
\begin{align} SO_n = {O_n - O_{n - M} \over M} \end{align}
(8)
\begin{align} RO_n = {I_n - I_{n - M} \over M} \end{align}
(9)
\begin{align} ST_n = {1 \over SO_n} \end{align}
(10)
\begin{align} RT_n = {1 \over RO_n} \end{align}

Be aware that throughput is undefined for M initial messages of the test, therefore, to get sensible results, N should be by orders of magnitude greater than M. To understand the impact of the value of M on the througput measurement, have a look at the graph below:

As can be seen on the graph, small M means that the throughput curve will be oscillating wildly, whereas large M means that the curve is quite insensitive to temporary peaks in the throughput. M's from the middle range (approx. 100) seem to produce most comprehensive graphs.

# Relationship between Latency and Throughput

It is important to understand that the latency and the throughput and not independent metrics, rather they are two ways to look at the same data. This paragraph explains the relationship between the two. (We'll use overhead metric instead of throughput because it makes equations simpler. Overhead is the inverse value of throughput anyway, so it should make no difference.)

Taking the definitions of latency, sender overhead and receiver overhead and doing few simple algebraic operations we will get following equation:

(11)
$${RO_n – SO_n} = {L_n – L_{n - 1}}$$

We can think of the right side of the equation as of a discrete counterpart of derivation. Rewriting the equation in the function-like way, we'll get:

(12)
$$RO (n) - SO (n) = L' (n)$$

In other words, difference between receiver overhead and sender overheat at particular point determines whether the latency curve is stable, raising or falling at that point.

In terms of throughputs:

(13)
\begin{align} {1 \over RT (n)} - {1 \over ST (n)} = L' (n) \end{align}

You can think of this equation as of generic model of "queueing system". If the publisher sends messages faster than they are received on the consumer, messages have to be "queued" somewhere on the way thus adding to the latency (time spent in a "queue" is a part of latency). On the other hand, if publisher sends messages slower than conusmer consumes them, "queue" size is gradually dropping and the latency is improving. If the sending and receiving rates are the same, size of the "queue" is static and latency is stable.

# Measuring CPU usage

The CPU usage should be measured separately for the three participants of the test, the sender, the broker and the consumer, irrespective of the fact that the sender and the receiver are running on the same box. The CPU usage on broker may be sent to the receiver in the message body to get all the measurements concentrated on the same box.

As all the other measurements, CPU usage should be measured for a particular message rather than for a particular point of time. There is a problem with measuring the CPU usage this way: If we measure the CPU usage in the specific moment of the message life-cycle, wouldn't that mean that we will never measure the CPU load in the different parts of the life-cycle?

To avoid this problem we should ensure that the window for measuring the CPU usage is at least as wide as the span of the part of the message life-cycle executed on the current box. That way we will leave no time slice unmeasured.

The ideal way to do this would to be measure number of clock ticks processor already spent executing the process in question (Un) along with current time (Tn) for each message. CPU usage for the particular message (Cn) can be then computed using following equation:

(14)
\begin{align} C_n = {U_n - U_{n - 1} \over T_n - T_{n - 1}} \end{align}

It may show up that either CPU usage computation is too resource-consuming and distorts the measurement in the considerable way or that clock ticks are too imprecise to get any sensible value of CPU load for a single message.

Both these problems can be solved by measuring Cn only for each n-th messsage (say 100-th) and assuming that the load is evenly distributed between the 100 messages.

The reasoning above applies to any auxiliary measurement loosely tied to the test - say network utilisation measurement, load on intermediary routers, on other middle-boxes, etc.

# Computing Aggregates

After executing the test with sufficient number of messages, we can get following aggregates for every value measured (latency, sender throughput, receiver throughput, CPU usage, etc.):

average
Average is useful to express overall performance of the system. For the user not doing any detailed analysis, this is how the messaging system performance will appear in the long run. However, it the performance is peaky, the average may not correspond to any actually measured value. Some values may be a way below the average, while others may be a way above it. To compute the average we use following equation:
(15)
\begin{align} X_{avg} = {\sum_{n = 1}^N X_n \over N} \end{align}
median
Median value is used to ignore occasional peaks and measure "normal performance". Note that although median works well when the performance is more or less stable with occasional peaks, it becomes rather arbitrary when the performance tends to oscillate around several alternative values. The median is N/2-th value in the sequence of measurements ordered in the ascending order. If there is even number of values in the sequence, there are in fact two medians, however, in our tests we will ignore the second one and consider only the lower one to be the median.
standard robust deviation
Standard robust deviation is a metric for the "peakiness" of the graph. Both peak hight and peak density are taken into account. It can be thought of as average peak hight. To compute the standard robust deviation we use the equation below. The alternative ""peakiness metric would be to square the differences from median and do the square root of the aggregate afterwards. We can possibly also measure median peak hight as well to get the value of "normal fluctuation of value X" disregarding any "abnormal" peaks.
(16)
\begin{align} X_{dev} = {\sum_{n = 1}^N |X_n - X_{med}| \over N} \end{align}
percentiles
Percentiles are values that divide the sample into two parts. Say 99th percentile is the value for which lower values form 99% of the sameple and greater values form 1% of the sample. Specificallly, 50th percentile is equal to median.

To understand the difference between the average and the media have a look at the graph below:

# Parametrised Aggregates

Sometimes we are interested in how certain aggregate value changes depending on a specific parameter. Say, we want to know how latency is affected by the different sizes of the messages transferred: What's the latency for 6-byte messages? For 100-byte messages? For 1 kB messages?
Although from the point of view of the end user, message size is the only interesting parameter, during the development we may be interested in the impact of various buffer sizes, batch sizes, time-out values, etc.
For an example of a parametrised aggregate have a look below. The graph shows dependency of the throughput (measured in both bytes and messages) median on the message size:

It should be noted that message size is a bit different from the other applicable parameters. Whereas in the tests with constant messages size the throughput in terms of messages per second and bytes per second are isomorphic and we can treat them as two representations of the same metric, in the tests where message size is a parameter, the two throughputs become different, even in some way inverse each to the another. Number of messages per second decreases as the size of the message grows. However, number of bytes per second increases as the message size grows until it ultimately approaches network bandwidth value.

# Distributions

Distribution is a way to describe how often do different values occur within the test. If we are interested how often low latencies occur, how often is the latency in the middle of the range and how dense are the latency peaks, distribution graph would help.

In the ideal case the distribution would be normal – represented by the bell curve. Obviously the bell curve should be as narrow as possible (broad bell curve means numerous peaks).

However, sometimes the distribution may have several humps instead of the single one. In this case the measured value tends to oscillate around several different gravitation centres. In other words, it may mean that say latency oscillates around 0.3 ms for an hour, however, the next hour it will oscillate around 1.6 ms. This kind of behaviour is highly undesirable in the messaging system and therefore we should check for multi-hump distributions when testing our software.

Distribution can be displayed as a histogram. There's no ideal value for the number of bins in the histogram and for how much they should overlap. Different values can reveal different traits of the system. Some more research should be done in this area.

Additionally, we may want to devise a “humpiness” metric that would allow us to measure humpiness automatically rather than by manually inspecting the histogram.

The output from our performance tests is formatted in the way to be readable by gnuplot tool. this way you can inspect the graphs in quick and efficient manner.

## Latency

Let's have a look at latency graph:

There were 10,000 messages in the test, the graph shows latency for each one of them. What we see is that the latency for first 500 messages is quite poor (up to 1,500 microseconds), however, afterwards it stabilises at approximately 150 microseconds, with occasional peaks up to 400 microseconds.

Different way to look at the same data is the histogram:

What we see at this graph is that most messages (over 4000) were transferred with the latency approximately 150 microseconds. There is a little hump at 340 microseconds, meaning that peaks this high happen not very often (160 occurences) but still more than peaks of 250 microseconds (75 occurences), suggesting that there's some kind of problem in the application that rises the latency from 150 microseconds to 340 microseconds occasionally.

## Throughput

As explained earlier there is no single throughput value - there are two of them. Therefore it makes sense to plot both throughputs curves into a single graph. The sender throughput and the receiver throughput tend to be very close each to another most of the time anyway:

As can be seen, sender and receiver throughputs differ at the beginning of the test. This seems to suggest that there is some kind of buffering involved on the lower layers of the stack. However, after first 500 messages, both sender and receiver throughput become stabilised at approximately 75,000 messages a second with occasional peaks down to 50,000 messages a second and up to 100,000 messages a second.

Note that gnuplot allows you to inspect interesting parts of the graph in detail:

It is of course possible to inspect throughput histograms in the same way we've inspected latency histogram.

# Conclusion

Our goal is to build a testing environment capable of measuring all the abovementioned metrics. The environment should also be implementation agnostic, i.e. it should be able to measure all kinds of messaging software ranging from raw TCP/IP to most sophisticated commercial implementations.

Written: 06 Aug 2007 07:00
Revised: 04 Aug 2010 20:01

rating: 0

Show summary of whitepapers category

Watch: site | category | page
page revision: 55, last edited: 04 Aug 2010 20:01