Tech 3 minutes read

Monitoring and alerting tools overview – pt2

Monitoring tools have been around for a couple of decades. What started as a loosely coupled collection of Bash and Perl scripts checking Linux system parameters through cron and sending emails to system administrators morphed into a sophisticated ecosystem of monitoring, alerting and analytics behemoths in a couple of decades.

It all started in the 90s when large IT systems became complex enough that centralized monitoring solutions needed to be implemented in order to keep critical infrastructure operational and as stable as possible. Many of those internal projects transformed themselves to open source and publicly available monitoring tools, such as Zabbix and Nagios (with Icinga being the notable fork of Nagios). Nowadays, most of those projects have their commercial enterprise counterparts as well.

The general way those tools work is that all of them are hosted on dedicated server instances and then monitoring agents are deployed to devices being monitored. Typically, those are the servers running Unix and Windows-based operating systems, but those could also be network devices (switches, routers) and various other kinds of appliances. The monitoring agents would then continuously collect information about the state of the production systems. Some of the most commonly monitored parameters are CPU time consumption, RAM usage, available disk space, network interface utilization. Depending on the thresholds for each of the parameters being monitored, some kind of alert would be dispatched to a designated person in case the value of the parameter being monitored is exceeded.

In the past, the alerts usually consisted of a simple email being sent to a predefined set of addresses, containing basic information about the alert and the exact values captured by the agent, and those violations were dealt with by the system administrators responsible for the given infrastructure. Nowadays, the alerts have transformed from emails to phone notifications and they are typically dispatched through the commercial, cloud-based alerting services, such as PagerDuty and OpsGenie that allow for much more sophisticated alerting capabilities, such as predefined on-call schedules, repeated alerts, multiple levels of escalations, etc. In addition to alerts popping up as notifications on engineers’ phones, many modern environments employ Slack as an additional channel for alert tracking - the alerts would simply appear as new messages in designated channels. That practice is also known as ChatOps.

Modern monitoring tools, such as Prometheus combined with Alertmanager plugin have brought a whole new array of advanced capabilities, including alert grouping, alert inhibition, native support for time series data storage, push and pull-based data collection, and extremely sophisticated graphing capabilities (a notable graphing tool by the name of Grafana emerged as a de facto standard for analytics and interactive visualization web app).

Most of the cloud providers are joining the race with their cloud-based monitoring offerings as well, including Amazon’s CloudWatch and Microsoft’s Azure Monitor. Modern DevOps and SRE practices demand modern and often quite complex monitoring solutions, making the monitoring landscape an extremely competitive and rapidly evolving area of the operations world.

Check out our other tech blog posts!