Custom Nagios Health Check for Client’s Global Infrastructure

We decided to showcase one old but interesting case we had, highlighting a way to ensure proper health checks on a global fleet were either fully production-ready or cleanly out of rotation. In this way, it’s an elegant way to employ Nagios and not over-engineer the solution.

The Challenge

When you run 5,000+ Linux servers across 60+ data centers, the margin for error is razor-thin. Moreover, the existing monitoring setup couldn’t capture the nuance of multiple server roles and service dependencies. Single-point checks were noisy, false positives were common, and ops teams wasted cycles troubleshooting healthy nodes.

Therefore, the client needed automation that could scale, adapt to role-specific checks, and deliver clear signals to Nagios.

Our Solution

We built check_cdn_server_heartbeat – a custom Nagios plugin designed to validate production readiness across the global infrastructure.

Key capabilities:

Role-aware checks – Edge, frontend, backend, and DNS resolver nodes all validated against their unique service sets.
Dependency validation – Routing (Bird), databases (MySQL), caches/proxies (Nginx, Varnish), and DNS (Unbound) checked in context.
Smart state reporting – Binary aggregation (1111) mapped to Nagios exit codes:
- ✅ OK – all critical services online
- ⚠️ Warning – services intentionally stopped for maintenance/removal
- ❌ Critical – unexpected service failure or misalignment
Straightforward integration – Written in Bash, built on standard Nagios tools (check_nrpe, check_http, fping).

The plugin shipped as open source under the MIT license, giving both internal teams and the broader community a reusable monitoring tool.

The Impact

By deploying the plugin, the Client’s ops team gained:

Higher accuracy – Service dependencies are validated together, which significantly cuts out misleading alerts.
Less manual effort – Automated health checks replaced ad-hoc node validation.
Faster incident detection – Misaligned nodes surfaced quickly.
Simplified configs – One plugin replaced multiple role-specific scripts.
Open source contribution – A reusable asset for anyone running a complex distributed infrastructure.

In short: a clearer monitoring signal, less noise, and healthier infrastructure across thousands of global servers.