Introduction
The term "jitter" is used to express how much do individual latencies tend to differ from the mean. To say that there's jitter in your application is not a positive statement. It means that individual tasks require different times to be processed. Say, 99% of the tasks are processed in one millisecond, while 1% hangs up for a while causing processing time to exceed one second. Such a behaviour is unwelcome at best and a showstopper at worst. This article explores the methodology and techniques for measuring and analysing latency jitter.
Raw data
We've run ØMQ latency test and measured duration of each roundtrip to get the sample data. There were 200,000 roundtrips in the test. The test was intentionally run on a laptop with GUI running at the same time to introduce various performance side effects and thus get more "interesting" sample:
There's one important thing to point out. The latency is measured by sending a message back and forth between two nodes and measuring how long does each roundtrip take. The duration is then divided by two to get a single-way latency rather than roundtrip latency. While this "pair-wise latency" doesn't effect the computation of mean, it has effect on the deviation. In general, it causes deviation to decrease, i.e. the peaks don't seem to be as large as they truly are.
Given the actual distribution of messaging latencies (large amount of more or less fairly distributed low values with several extremely high peaks) the impact of "pair-wise latency" is likely to be negligible for results close to median value, however, the actual peak latencies are probably twice the latencies measured. For example, the roundtrip consisting of two subsequent "normal" latencies like 74.99 us and 75.01 us will yield latency of 75 us. However, if the roundtrip consists of a normal latency (75 us) and a peak latency (12000 us) the resulting latency will be 6037.5 us - the half of the actual peak. Two subsequent peak latencies would yield a reasonable estimate of the peak, however, such an event is extremely unlikely.
The impact of measuring latencies pair-wise is subject to further research.
Deviations
The obvious way to measure jitter is to compute standard deviation of the sample:
mean | 77.06 us |
---|---|
standard deviation | 284.76 us |
However, as latency distributions are usually fat-tailed (empirical evidence, you can spot the fact on histograms below yourself) metrics like standard deviation are of little practical value. It may be wiser to use a robust deviation metric like median absolute deviation:
median | 75 us |
---|---|
median absolute deviation | 26.69 us |
The interpretation of median absolute deviation would then be "the most typical deviation from the most typical latency", the "most typical latency" being median value.
The problem with deviations is twofold:
- Distribution of latencies tends to be highly non-normal with clustered outliers and strong positive skew. With such a distribution it's difficult to interpret deviation figures.
- More importantly, jitter analysis is concerned primarily with the outliers. Deviation metrics don't say much about outliers. More differentiated approach is required.
Histograms
The straightforward way to graph peaks as well as regular values would be to use histograms. However, even a single attempt to use histogram for jitter analysis will show that it's not the best possible way. All the values are tightly packed to the left of the histogram while the slots to the right are so underpopulated that they are not even visible:
To stretch the values more to the right we may opt to logarithm the input values before creating the histogram. The result looks better, however, the rightmost slots - the once representing the peaks - the values we are most interested in are still so low as to be invisible:
Next transformation thus would be to logarithm the y-axis of the histogram so that even a single occurence is a slot becomes visible. (Note that there's no logarithm of zero, thus we have to handle empty slots separately.) At last we get a readable graph:
However, there are still two problems with this representation of jitter:
- The slots for peak latencies are very wide because they are logarithms of the original values. One slot holds say all latencies from 1000 us to 10000 us. This gives us only a dim idea of the actual peak latencies.
- The histogram - after undergoing two log10(x) transformations - is pretty unintuitive and hard to interpret.
While such a histogram may still be of some use, there's a different way of presenting jitter that avoids the two drawbacks mentioned above.
Quantile graphs
This method is based on notion of quantiles. There are different quantiles out there, however, in this article we'll use only a specific type of quantile called percentile. A percentile is the value of a variable below which a certain percent of observations fall.
Say we have a latency sample with million values. 99th percentile of the sample would be such latency that it's greater than 990,000 latencies in the sample and lower than the remaining 1000 latencies.
It is pretty obvious that 0-th percentile is the minimal value of the sample and the 100-th percentile is the maximal value of the sample. 50-th percentile is the median value.
Now, the idea is to chart all the percentiles into a single graph. We'll get the minimal latency as the leftmost point of the graph and maximal latency (highest peak) as the rightmost point of the graph:
There are two ways to read the graph:
- What portion of the sample has better latency than N microseconds? Find N on y axis, find the corresponding point in the graph, look at it's x coordinate. That's the percentage you are looking for.
- What's the worst latency to expect in N% of cases? Find N on x axis and find the corresponding value on y axis. That's is the latency that you can guarantee for N% of cases.
It's quite easy to grasp the overall distribution of latencies from the graph. If it keeps low almost all the time and shoots up on the extreme right, the latencies are almost uniformly low, but there are some outliers in the sample. If the graph is a straight line, the latencies are evenly distributed between minimum and maximum value. If the graph quickly increases upwards and stays almost constant afterwards, the latencies are almost uniformly high with some downward peaks.
It's pretty clear from our graph that 99% of the sample consists of rather reasonable latencies, while the remaining 1% contains peaks as high as 60000 microseconds.
To have a better look at the lower part of the graph, let's ignore the peaky 1% of the graph:
The black lines show how a latency can be converted to the percentage of the sample with better latencies and vice versa.
While correlating percentages and latencies is extremely useful by itself, there's more to it. Looking at the graph you can determine whether the latencies are distributed more or less uniformly or whether there are several latency clusters in the sample. This kind of information is useful for analysis: two distinct latency clusters would indicate that there's an occasional source of latency in the system: Most messages don't hit the problem (lower cluster), some do (higher cluster).
The rule of the thumb for identifying cluster is as follows: Flat lines are clusters, steep lines are empty latency ranges with no or little samples. Thus if we have a graph that keeps flat, the goes up quickly, then keeps flat again, we know we have to clusters in the sample.
To get precise metrics of an individual cluster, get x and y values of the leftmost and rightmost point of the flat line (cluster). See the picture above. Black lines are used to delimit a single cluster. On y axis it can be seen that latencies in the cluster range fro 88 to 96 microseconds. On x axis the cluster ranges from 56% to 95%. Thus it contains 39% of the sample (95-56).
For more exercise with reading quantile graphs try to interpret the following one (distribution of the highest peaks in our sample):
Raw data revisited
However useful the statistical data, there's still no substitute for visual inspection of the raw data. Human brain is much better at finding patters than any statistical method.
Let's have a look at our raw data, however, let's ignore the outliers and look only at the low part of the graph containing 99.9% of the sample:
It's immediately obvious that the latencies are dependent on time. There are time intervals where latencies are higher and time intervals where latencies are lower. The assumption would be that there were different processes running on the laptop that caused the latency test to perform better or worse depending on whether they were running or idle at the moment.
Conclusion
This article describes several methods for identifying and measuring latency jitter. The methodology is indispensable when analysing high performance systems sensitive to latency jitter (algo-trading, stock order matching).
Is the raw data and the code available somewhere, so that I can run the same experiment ?