Tech

Custom Nagios Health Check for Client’s Global Infrastructure


We decided to showcase one old but interesting case we had, highlighting a way to ensure proper health checks on a global fleet were either fully production-ready or cleanly out of rotation. In this way, it’s an elegant way to employ Nagios and not over-engineer the solution.

The Challenge

When you run 5,000+ Linux servers across 60+ data centers, the margin for error is razor-thin. Moreover, the existing monitoring setup couldn’t capture the nuance of multiple server roles and service dependencies. Single-point checks were noisy, false positives were common, and ops teams wasted cycles troubleshooting healthy nodes.

Therefore, the client needed automation that could scale, adapt to role-specific checks, and deliver clear signals to Nagios.

Our Solution

We built check_cdn_server_heartbeat – a custom Nagios plugin designed to validate production readiness across the global infrastructure.

Key capabilities:

  • Role-aware checks – Edge, frontend, backend, and DNS resolver nodes all validated against their unique service sets.
  • Dependency validation – Routing (Bird), databases (MySQL), caches/proxies (Nginx, Varnish), and DNS (Unbound) checked in context.
  • Smart state reporting – Binary aggregation (1111) mapped to Nagios exit codes:
    • OK – all critical services online
    • ⚠️ Warning – services intentionally stopped for maintenance/removal
    • Critical – unexpected service failure or misalignment
  • Straightforward integration – Written in Bash, built on standard Nagios tools (check_nrpe, check_http, fping).

The plugin shipped as open source under the MIT license, giving both internal teams and the broader community a reusable monitoring tool.

The Impact

By deploying the plugin, the Client’s ops team gained:

  • Higher accuracy – Service dependencies are validated together, which significantly cuts out misleading alerts.
  • Less manual effort – Automated health checks replaced ad-hoc node validation.
  • Faster incident detection – Misaligned nodes surfaced quickly.
  • Simplified configs – One plugin replaced multiple role-specific scripts.
  • Open source contribution – A reusable asset for anyone running a complex distributed infrastructure.

In short: a clearer monitoring signal, less noise, and healthier infrastructure across thousands of global servers.

Technology Snapshot

  • Monitoring: Nagios / NRPE
  • Language: Bash
  • Dependencies: check_http, check_nrpe, fping

Repo: nagios-plugins/check_cdn_server_heartbeat

BlueGrid.io Content Team

Three people pose together against a plain white background. The woman on the left is smiling with her hand on her hip, while the two men beside her stand closely, one in a hoodie and the other in a plaid shirt.

BlueGrid.io Content Team

BlueGrid.io Team is an editorial collective of engineers, practitioners, and contributors sharing insights across technology, operations, company culture, and the people behind the systems. Content is created through interviews, hands-on experience, internal collaboration, and editorial review, reflecting both how systems are built and how teams work together in real-world environments.

Share this post

Share this link via

Or copy link