Mining for execution patterns in streaming log data
Many people can’t easily enable tracing in their environment, let alone cross-application tracing of user requests. Unomaly produces both anomalies and condensed common log events (“profiles”). With a new feature we call Workflows, Unomaly allow user to look at common sequences of log events to give you a window of visibility into the execution paths that are present in your regular log events. Think of it as a simple variant of tracing, but without the effort.
Workflows answer the simple question, “what are my systems doing?” by providing you with a simple way to gain a deeper understanding of your infrastructure.
In future iterations of Workflows, we will enable additional context for anomalies, improve incident investigations and have workflows serve as a model for anomaly detection when a sequence of log events change.
Extracting workflows from your logs
Let’s say you have an application that runs backup jobs of two different kinds (incremental and full). Your logs may look something like this:
2019-05-22T10:01:00 INFO - Full backup: started job_id=123
2019-05-22T10:02:00 INFO - Full backup: created new volume snapshot job_id=123
2019-05-22T10:03:00 INFO - Full backup: finished with success job_id=123
2019-05-22T11:01:00 INFO - Incremental backup: started job_id=456
2019-05-22T11:02:00 INFO - Incremental backup: updated existing volume snapshot job_id=456
2019-05-22T11:03:00 INFO - Incremental backup: finished with success job_id=456
Visual inspection tells us that there are two workflows appearing here. The first three log event comprise the “full backup” workflow and the last three log events make up the “incremental backup” workflow. In a real world scenario, different backup jobs will run concurrently and the log events may be intermixed. interspersed/interleaved.
Given these 6 kinds of log events, the output from the workflow extraction process would look like this:
Workflow 1:
Profile A: <timestamp> INFO - Full backup: started job_id=<number>
Profile B: <timestamp> INFO - Full backup: created new volume snapshot job_id=<number>
Profile C: <timestamp> INFO - Full backup: finished with success job_id=<number>
Workflow 2:
Profile D: <timestamp> INFO - Incremental backup: started job_id=<number>
Profile E: <timestamp> INFO - Incremental backup: updated existing volume snapshot job_id=<number>
Profile F: <timestamp> INFO - Incremental backup: finished with success job_id=<number>
How workflow extraction works
The starting point for workflow extractions came from the research paper DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. There are two approaches to workflow extraction described, the first is density-based clustering and the other is based on deep neural networks (specifically LSTM). We chose to implement the density-based algorithm, mainly because it is less complex, requires less memory/CPU and is easier to debug (the DNN approach is more of a black box). According to the paper’s authors, both clustering approaches yielded similar results.
The main steps to extract workflows from streaming log events are:
- Tokenize log events
- Match events into “profiles” (i.e. message types/message templates)
- Run density-based clustering to find common workflows
- Store workflows in a database
Tokenization and matching is already part of what Unomaly does in its core algorithm. The main idea is that log events are broken up into a list of tokens which in turn is used to find which tokens make up the message template (“profile” in Unomaly) and which tokens are dynamic/parameters.
Recording data
When a log event comes in, we look up the profile ID (A to F in our example below) and store it in a ring buffer. The ring buffer holds the profile IDs of the last 10 log events.
For example, our ring buffer from the example will look like [A, B, C, D, E, F, X, X, X, X] (X denotes some log events outside our example and are not of importance here).
The density-based clustering is based on co-occurrence matrices. A co-occurrence matrix stores the counts for log events following each other. The “1” at row A, column B means that we have seen the sequence [A, B] once.
Our example (without the X’s) would be a 6x6 matrix and it would look like this:
Profile ID | A | B | C | D | E | F |
A | 1 | |||||
B | 1 | |||||
C | 1 | |||||
D | 1 | |||||
E | 1 | |||||
F |
To make the clustering more robust, we have matrices for different “depths”. Depth refers to the number of steps between two profiles that are co-occurring. For example, at depth 1 our examples co-occurrences are [A, B], [B, C], [C, D], [D, E], [E, F]. At depth 2, we get the following co-occurrences by skipping one element each time: [A, C], [B, D], [C, E], [D, F]. At depth 3, two elements are skipped and so on.
The matrices are recorded in-memory on a per source system basis and are flushed to disk at regular intervals. Since they are plain counts, they can be aggregated when being loaded at clustering time.
Clustering
A separate service runs the clustering job at regular intervals. It loads the last 24 hours of data, aggregates the matrices and starts the clustering.
First, matrices are loaded and normalized by dividing by the row-wise sum. This produces a matrix with probabilities that a profile in column j follows the profile in row i.
The clustering procedure for a “system” is as follows:
- Iterate over all co-occurrence probabilities in the depth 1 matrix.
- If a probability is larger than a threshold (e.g. 0.9), create a new workflow (e.g. [A, B])
The output is written to a database as workflows, i.e. lists of profile IDs that often appear in sequence.
Feedback & Next Steps
We developed Workflows because they provided value to us looking at our own systems. In some cases we found workflows that we didn’t know about and they taught us something about code execution paths in our own systems. We’d love some feedback from others if this kind of visualization is useful / valuable for them as well.
Future work includes using anomaly detection to detect workflow deviations. We also think it is possible to improve the mining process by automatically identifying forks and concurrent execution inside workflows.
We’re researching adding support for tracing to Unomaly which would open up the possibility to run hybrid models combining results from software instrumented with tracing with non-instrumented software from for example vendors or libraries.
We’re really curious what you think of this first iteration of Workflows. If you have used it and found it useful or completely useless hop into our Unomaly-friends Slack channel to let us know about your experiences and what functionality you believe we should further investigate.