Monitoring is one of the most important areas of maintaining a product. Knowing when something breaks or degrades the performance for your users is incredibly important, and to be efficient when investigating, you'll need to do some work upfront that'll help you through hard times.
Observability, knowing what's happening in your systems, can be achieved through several methods: Usually, alerts will notify you that something is off, but the real work only starts then. Do you have logs in place that allow you to dig through what happened in your system leading up to the alert? Do you have detailed error reports and stack traces that pinpoint the failing component? Do you have distributed traces that cross service boundaries?
A couple of months ago I published a post on using NewRelic to get started with scaling observability infrastructure quickly, so some concepts I'll talk about in this post might be influenced by how you deal with data in NewRelic's telemetry data platform.
Logs: Constant streams of noise
Out of all possibilities, I personally have found logs to be the least useful when something goes wrong, and most difficult to maintain. Logs are unstructured unless you comply with the platform you're transferring them to, they're large in volume, need to be compacted so your infrastructure doesn't overflow, put simply: Logs are in the way when you don't need them, but they don't help you nearly enough to debug a problem, compared to other solutions.
Distributed Tracing: All your services, end-to-end
Adding traces to your web services allows capturing which tasks run on certain transactions (web requests or background tasks like cron jobs, workers, etc.) and how long they take. Additionally, you can supplement traces with custom attributes to present you with even more details, like which user performed an action.
When you deal with multiple services, even if you just have a frontend and backend application, you can make use of distributed tracing: All of your traces are now stitched together into one activity stream for the duration of the same transaction. You can instrument every service to add its own traces to the previous transaction, and thus create a service graph, telling you instantly which parts of your infrastructure were concerned with a specific action.
Last but not least, one of the most efficient data your applications should report are errors: Whenever something goes wrong, you should be able to investigate it quickly. Increased error rates should lead to alerts being triggered, notifying you of the situation.
When you open up your monitoring solution and take a look at the reported error, you should see a stack trace, telling you where the error occurred, and contextual information, for example, which user this happened for, or which arguments were passed to the failing function. Don't skimp with details, add everything you need to get to the root of the issue quickly.
Once you've found the perfect balance of which details to include, one error log can tell you the whole story. And if you're dealing with multiple intertwined services, you can simply include the transaction details from distributed tracing, which you can use to jump to the recorded trace.
I think having a robust setup that not only lets you know when things go awry but gives you all the directions you need to investigate is extremely valuable. While it might take some time to get right, you'll save a lot of time when it's most critical.