Monitoring and alerting tools overview
You’ve got online services used by your end-users, or maybe you have the mission-critical internal infrastructure. You need them to be online, almost, all the time. In short - you need monitoring tools.
Setting up a monitoring system and tailoring it to your specific needs is not an easy task. There are numerous aspects you need to factor in, such as the complexity of the infrastructure you need to monitor and the malleability of all the tools you need in order to have an effective monitoring solution.
Luckily, there’s been a Cambrian explosion of all kinds of monitoring and alerting tools in recent years, and what’s even better, most of the modern tools are interoperable and allow for easy integration with one another. When it comes to designing an effective monitoring and alerting solution for a particular service, no size fits all, so be prepared for a fair bit of planning ahead.
What to monitor?
The technical aspect is just one side of the puzzle. The other (and perhaps more challenging) part is figuring out what you should monitor, in what time intervals, and what should the on-call schedule and alerting be like. A general rule of thumb is to extract the key parameters of the service you’re responsible for, such as uptime, response time, number of transactions per second/minute, but don’t stop at the technical parameters only. Often, it’s a smart idea to monitor business-related performance indicators, such as the number of new signups in a specific time frame, or the number of new incoming support tickets.
Who is responsible for incidents response?
Once you’ve figured it all, the next step is to decide who gets paged and when. Again, this depends on the overall complexity of the service and is likely tied to your Service Level agreements, among other things. Sometimes, having a few employees getting notification emails during the day is sufficient. Other times, you might need to have precisely defined rotating on-call shifts in following the sun fashion with multiple, geographically distributed Ops teams getting alerts in real-time.
Implementing effective monitoring and alerting systems and procedures can take a lot of time and money, likely with a few hiccups along the way. Also, most of the procedures and tools are not and should not be set in stone - be prepared to learn from your failures and adapt accordingly! New tools, industry best practices, and insights are popping up almost daily, so make sure you stay in the loop with all the recent developments. On the other hand, don’t jump ship and adopt new stuff the moment some new tool starts gaining significant popularity. At that point, you’ve likely already invested a lot of resources into establishing proper and well functioning incident and response teams and procedures, and if the shiny new thing doesn’t add genuine value to your existing setup or doesn’t make some part of the daily operations significantly easier, chances are it’s not worth the risk of an unplanned outage due to a lack training or familiarity with the cool new kid in town.
Check out our other tech blog posts!