Even the most coordinated teams could use more insight
When your environment is successfully serving users and requests
You will want to see the patterns and events that make everything tick, down to every event, on every dependency.
Then when a deployment causes a production issue
Identify which new errors and tracebacks where produced, when and where. Act before it becomes worse.
You may find that a memory allocation issue started it all.
After a failure, go back in time and review how the anomalies happened across multiple dependencies over time.
Don’t forget to document your fix.
Tie the learnings from incidents, postmortems etc to the data by adding knowns and tags to the data.
Finally, if the problem reoccurs, alert the right person.