Monitoring and Alerting tools – pt3 – Modern monitoring setup
Building a modern monitoring solution for your business or application is not an easy task. In fact, it’s a significant undertaking that should be treated as its own project! This article goes into the details of what a modern monitoring setup consists of and how the interdependencies should be managed. It focuses on modern tools and services and assumes no legacy monitoring solution exists.
The setup outlined below is usually sufficient for small to medium-sized production deployments. It is able to accommodate most monitoring needs in typical scenarios. More complex production systems will likely need additional services and significant redundancy across all node categories. Those setups are not the focus of this article.
Central instance - Prometheus server
Prometheus server is the key instance that acts as a central storage database for time series data which gets collected from production instances. It also comes equipped with a very powerful PromQL query language that can be used for querying the scraped datasets.
Alertmanager - alert routing instance
While it’s not uncommon for Alertmanager to reside on the same host as Prometheus, Alertmanager should still be viewed as a separate service. Its job is to take alerts coming from Prometheus instances and route them to appropriate services, whether those are emails, Slack notifications, or external commercial alerting services such as OpsGenie or PagerDuty. It also comes with an advanced alert - related capabilities such as alert grouping and deduplication, which ensures the on-call teams get just the alerts they need, which helps in reducing alert noise.
node_exporter - exporter agents
Every host that’s monitored needs to be able to report all the relevant metrics to the Prometheus instances, and that’s where node_exporter comes in. It resides on production hosts that are monitored and continuously reports all relevant metrics to the Prometheus instances. Depending on the exact configuration, a Prometheus instance can either pull the information from node_exporter agents in timed intervals, or the node_exporter agents can push the monitoring data when appropriate. Both approaches have their pros and cons - the exact scraping schedule depends on the nature of hosts and services being monitored. It’s important to note that node_exporter is not the only agent that works with Prometheus. For example, Telegraf is a well-known alternative.
Grafana - web UI
While it’s not technically necessary to have a sophisticated web GUI for viewing metrics and incoming alerts (Prometheus can handle both of them just fine), it’s usually a good idea to have a dedicated app for enhanced viewing and visualization capabilities. That’s where Grafana kicks in. It comes with advanced graphing and visualization facilities and is able to query Prometheus instances directly and therefore present the raw time series data in an unlimited number of ways. Grafana accomplishes its magic through dedicated dashboards that can be adjusted in just about every way. They usually consist of various text-based lists, pivot tables, dual-axis graphs, etc. By harnessing the power of Grafana dashboards, modern SRE and DevOps teams can significantly reduce their Time To Resolve metrics and improve the overall stability of production systems.