We decided to showcase one old but interesting case we had, highlighting a way to ensure proper health checks on a global fleet were either fully production-ready or cleanly out of rotation. In this way, it’s an elegant way to employ Nagios and not over-engineer the solution.

The Challenge
When you run 5,000+ Linux servers across 60+ data centers, the margin for error is razor-thin. Moreover, the existing monitoring setup couldn’t capture the nuance of multiple server roles and service dependencies. Single-point checks were noisy, false positives were common, and ops teams wasted cycles troubleshooting healthy nodes.
Therefore, the client needed automation that could scale, adapt to role-specific checks, and deliver clear signals to Nagios.
Our Solution
We built check_cdn_server_heartbeat – a custom Nagios plugin designed to validate production readiness across the global infrastructure.
Key capabilities:
- Role-aware checks – Edge, frontend, backend, and DNS resolver nodes all validated against their unique service sets.
- Dependency validation – Routing (Bird), databases (MySQL), caches/proxies (Nginx, Varnish), and DNS (Unbound) checked in context.
- Smart state reporting – Binary aggregation (1111) mapped to Nagios exit codes:
- ✅ OK – all critical services online
- ⚠️ Warning – services intentionally stopped for maintenance/removal
- ❌ Critical – unexpected service failure or misalignment
- Straightforward integration – Written in Bash, built on standard Nagios tools (check_nrpe, check_http, fping).
The plugin shipped as open source under the MIT license, giving both internal teams and the broader community a reusable monitoring tool.
The Impact
By deploying the plugin, the Client’s ops team gained:
- Higher accuracy – Service dependencies are validated together, which significantly cuts out misleading alerts.
- Less manual effort – Automated health checks replaced ad-hoc node validation.
- Faster incident detection – Misaligned nodes surfaced quickly.
- Simplified configs – One plugin replaced multiple role-specific scripts.
- Open source contribution – A reusable asset for anyone running a complex distributed infrastructure.
In short: a clearer monitoring signal, less noise, and healthier infrastructure across thousands of global servers.
Technology Snapshot
- Monitoring: Nagios / NRPE
- Language: Bash
- Dependencies: check_http, check_nrpe, fping