Observability on unstructured logs: it’s possible!
Unomaly extracts the structure of unstructured logs to match them to what we call profiles and detect never before seen events in your infrastructure.
While very heavily hyped and marketed by the new players, observability represents a significant shift in the way we try to make our software reliable. This shift needs to happen for a simple reason: we don’t (can’t?) understand our systems anymore.
Why observability is here to stay
Software used to be simple enough so that one could monitor its state in order to know if something was wrong and to immediately know why it was broken. One could create a set of alerts based on some known errors or known metrics and immediately react appropriately when one of those alerts were triggered. But in today’s systems, there are hundreds or thousands of potential causes. They are unknown. No amount of dashboards nor alerts can solve this fundamental problem. We need to change our approach and have the ability to ask any questions to our systems to observe them while they’re running.
High cardinality is key
Observability is about being able to look at all the data a system produces and having the tools to slice and dice it in a meaningful way, whether it’s about controlling the HTTP status codes our API returns (a classic low cardinality problem that metrics and time series databases are built for) or asking for the top 100 users experiencing latencies above 100 ms. This spontaneously sounds like a query we shouldn’t be “allowed” to ask, as we’ve been so used to carefully thinking about our metrics and to not create too many prometheus labels :-)
As Cindy Sridharan puts it, “Understanding complex systems becomes much easier when high cardinality isn't something you "curtail" or limit, but something you embrace.” If you aggregate data at write time, you lost.
I have great respect for the Dropbox engineering team, but respectfully, this screams square peg in a round hole ☹️
— Cindy Sridharan (@copyconstruct) November 20, 2019
Understanding complex systems becomes much easier when high cardinality isn't something you "curtail" or limit, but something you embrace. https://t.co/K8V7C6ykh3
Getting there with logs
Now that we’ve cleared some definitions and showed the limits of write time aggregation, let’s talk about what you should do: Implement distributed tracing! Structure your logs… if you can. The truth is that the vast majority of the software we run is provided by third parties and that 99% of the signal it produces is… unstructured logs. So…. what if I told you that you can get observability on your “old” logs? Meet the profile details:
Searching for all the error logs on one of our deploys. We can see directly that our metrics collection task fails periodically :-)
Want to see the top URLs your ingress proxy is serving by IP? You got it. Want to see all the calls from a specific IP? Just check your access logs details. Want to see the trend of a specific error for users in France the last 7 days? That works, too.
All of this with no parsing, no user input required, no new instrumentation. Just your existing logs, whether they’re structured or not (the right data needs to be in the logs though! Log more context!).
Also it’s powered by our custom event datastore, and it’s very, very fast. Combined with our detection of new events, it makes investigation so much easier and natural.
Breaking down our access logs by HTTP status code takes milliseconds.
From the group by visualization we can isolate the 500s we get and then look at the raw events to see what happened.
The profile details view lets you query and see all the data for a given log type (what we call a profile). You can filter and group by on any field, so that you can compare events and isolate the data you need. If you need to see the actual full events, expand the Events section and the logs will be there. It’s just like Google Analytics, but for your logs. And that makes a lot of sense, doesn’t it?
Web analytics have shown us for well over a decade that they’re golden when it comes to understanding our users. Why don’t we start applying the same patterns to our software? It’s almost as complex as people.