Tracking frequency anomalies across millions of events in realtime

Unomaly can detect and highlight change in your applications and infrastructure by separating the noise from the new data.

Why frequency anomalies?

Some the use-cases that frequency anomaly detection in Unomaly can enable are

  • Detecting when an hourly backup job didn’t run
  • Detecting when a known error goes from 20 occurences per second to 2000
  • Detecting when a service is down

The power of the feature really shines when these newly detected anomalies are coupled with Unomaly’s detection of new data: the backup job didn’t run because someone mistakenly removed it from the crontab; errors are spiking because the newly deployed service has a parsing bug, the service is down because a service it depends on just crashed…

Finding the right algorithms

There are plenty of existing algorithms for anomaly detection. Each of them have their own strength, but they often require a quite significant sample size in order to get an accurate detection. This means high storage requirements and often performance levels that are too low for the load that Unomaly on-premise instances can experience ( 50k events per second on midsize machines). In order to keep it simple, we’ve chosen to store a small time series with each of our event type in MongoDB and started experimenting with the z-score of our time series: the z-score is how many standard deviation an element is from the mean.

z = (X - μ) / σ

By defining a threshold s, we can say that a profile is spiking when z>s. This calculation is so cheap that it can be done for every event.

Normal distribution and Z-score:

We apply this algorithm to a window of roughly two hours to detect profiles that are spiking. To compensate for the noise created by this algorithm, we run a second stage to the algorithm that we call “noise suppression”. Noise suppression looks at how often we’ve seen a profile spiking and decreases the score we give to to the anomaly if we’ve seen the profile spike a lot recently. This simple filtering step ends up removing a lot of the noise and leaves us with the spikes that are really abnormal that we end up presenting in our UI.

This algorithm is limited and cannot detect any complex patterns such as seasonality, but it’s good enough to detect big changes that might happen when an incident is starting.

A low frequency log started happening a lot more often:

Detecting stopped events

There are a lot of unimportant events that can stop happening and happen again a few/hours/days later. To avoid creating tons of false positives, we’ve limited our detection to events that we think are periodic, with the idea of detecting when something like a CRON job didn’t run. For that we collect the mean inter-arrival time of a log event and its standard deviation. If the standard deviation is very low, then we can assume that we’re dealing with a periodic task. For example, a typical daily CRON job would have a mean inter-arrival time of 24 hours and a standard deviation of less than 1 second. Using the 99.7% rule, we can predict with high certainty the next expected event for all of these periodic profiles.

To calculate online inter-arrival time variance, we used a parallel algorithm : two windows of 5000 samples are kept for each profile and rotated when full. This gives enough history to reliably compute a representative variance of the inter-arrival time. The algorithm then periodically performs a query against our database to find all events that have missed their “deadline”. These events are then reported as stopped. A similar suppression mechanism makes sure we don’t spam users with repeatedly stopping events.

Missing CRON jobs can now be detected within seconds and immediately reported to our users!

One of our services stopped logging its status every 5 seconds:

What’s next?

While this approach is sufficient to detect big shift in frequencies, it does not understand seasonal changes and can miss more subtle changes. We plan on looking into using a TSDB like Prometheus or InfluxDB to store longer term data on how our profiles behave.

Written by Alexandre Pesant
on March 19, 2019