How to Monitor Distributed Systems

This is a how-to setup for how to monitor distributed systems. This is a complete distributed monitoring stack:

Assumptions:

  • Linux with systemd (Ubuntu/Debian style paths)
  • One monitoring host: mon-01
  • Two app nodes: app-01, app-02
  • DNS or /etc/hosts resolves those names
  • Ports open:
    • Node Exporter 9100 on app nodes
    • Prometheus 9090, Alertmanager 9093, Grafana 3000, Blackbox 9115, OTel Collector 4317/4318 on mon-01
  • Replace example domains, emails, and endpoints

1) Node Exporter on every Linux node (app-01, app-02)

1.1 Create user

sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter || true

1.2 Install binary

cd /tmp
# put node_exporter-*.linux-amd64.tar.gz on the server
tar -xzf node_exporter-*.linux-amd64.tar.gz
sudo cp node_exporter-*.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo chmod 0755 /usr/local/bin/node_exporter

1.3 Systemd unit

Create /etc/systemd/system/node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=":9100" \
  --collector.systemd \
  --collector.processes
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target

1.4 Start

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://127.0.0.1:9100/metrics | head

2) Prometheus on mon-01

2.1 Create user and dirs

sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /etc/prometheus/rules /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

2.2 Install binaries

cd /tmp
# put prometheus-*.linux-amd64.tar.gz on mon-01
tar -xzf prometheus-*.linux-amd64.tar.gz
sudo cp prometheus-*/prometheus prometheus-*/promtool /usr/local/bin/
sudo cp -r prometheus-*/consoles prometheus-*/console_libraries /etc/prometheus/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo chown -R prometheus:prometheus /etc/prometheus

2.3 Prometheus config

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
rule_files:
  - /etc/prometheus/rules/*.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["127.0.0.1:9093"]
scrape_configs:
  - job_name: "node"
    static_configs:
      - targets:
          - "app-01:9100"
          - "app-02:9100"
        labels:
          env: "prod"
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://service-a.example.com/health
          - https://service-b.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115
  - job_name: "blackbox-tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
          - db-01.example.com:5432
          - mq-01.example.com:5672
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115
  - job_name: "blackbox-icmp"
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
          - app-01
          - app-02
          - db-01.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115

2.4 Prometheus systemd unit

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=":9090"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target

2.5 Start

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus --no-pager
Verify:
http://mon-01:9090/targets

3) Alertmanager on mon-01

3.1 Create user and dirs

sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

3.2 Install binaries

cd /tmp
# put alertmanager-*.linux-amd64.tar.gz on mon-01
tar -xzf alertmanager-*.linux-amd64.tar.gz
sudo cp alertmanager-*/alertmanager alertmanager-*/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool

3.3 Configure Alertmanager

Create /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
route:
  group_by: ["alertname","instance","job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "email-default"
receivers:
  - name: "email-default"
    email_configs:
      - to: "[email protected]"
        from: "[email protected]"
        smarthost: "smtp.example.com:587"
        auth_username: "[email protected]"
        auth_password: "REPLACE_ME"
        send_resolved: true

3.4 Systemd unit

Create /etc/systemd/system/alertmanager.service:

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=":9093"
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

3.5 Start

sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager --no-pager
Verify:
http://mon-01:9093

4) Prometheus alert rules (host + probes)

Create /etc/prometheus/rules/distributed.yml:

groups:
- name: distributed

  rules:

  - alert: HostDown
    expr: up{job="node"} == 0
    for: 2m
    labels: {severity: critical}
    annotations:
      summary: "Host down: {{ $labels.instance }}"

  - alert: HostHighCPU
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "High CPU on {{ $labels.instance }}"

  - alert: HostLowMemAvailable
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "Low MemAvailable on {{ $labels.instance }}"

  - alert: HostDiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
    for: 15m
    labels: {severity: warning}
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"

  - alert: HostDiskBusy
    expr: rate(node_disk_io_time_seconds_total[5m]) > 0.6
    for: 15m
    labels: {severity: warning}
    annotations:
      summary: "Disk busy on {{ $labels.instance }}"

  - alert: ProbeHttpDown
    expr: probe_success{job="blackbox-http"} == 0
    for: 1m
    labels: {severity: critical}
    annotations:
      summary: "HTTP probe failed: {{ $labels.instance }}"

  - alert: ProbeTcpDown
    expr: probe_success{job="blackbox-tcp"} == 0
    for: 1m
    labels: {severity: critical}
    annotations:
      summary: "TCP probe failed: {{ $labels.instance }}"

  - alert: ProbeIcmpDown
    expr: probe_success{job="blackbox-icmp"} == 0
    for: 1m
    labels: {severity: warning}
    annotations:
      summary: "ICMP probe failed: {{ $labels.instance }}"

Validate and reload Prometheus:

sudo promtool check rules /etc/prometheus/rules/distributed.yml
sudo promtool check config /etc/prometheus/prometheus.yml
sudo systemctl reload prometheus || sudo systemctl restart prometheus

5) Blackbox Exporter on mon-01

5.1 Create user and dirs

sudo useradd --no-create-home --shell /usr/sbin/nologin blackbox || true
sudo mkdir -p /etc/blackbox_exporter
sudo chown -R blackbox:blackbox /etc/blackbox_exporter

5.2 Install binary

cd /tmp
# put blackbox_exporter-*.linux-amd64.tar.gz on mon-01
tar -xzf blackbox_exporter-*.linux-amd64.tar.gz
sudo cp blackbox_exporter-*/blackbox_exporter /usr/local/bin/
sudo chown blackbox:blackbox /usr/local/bin/blackbox_exporter
sudo chmod 0755 /usr/local/bin/blackbox_exporter

5.3 Config

Create /etc/blackbox_exporter/blackbox.yml:

modules:

  http_2xx:
    prober: http
    timeout: 5s

  http:
      preferred_ip_protocol: ip4

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s

5.4 Systemd unit

Create /etc/systemd/system/blackbox_exporter.service:

[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=blackbox
Group=blackbox
Type=simple

ExecStart=/usr/local/bin/blackbox_exporter \
  --config.file=/etc/blackbox_exporter/blackbox.yml \
  --web.listen-address=":9115"

Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target

5.5 Start

sudo systemctl daemon-reload
sudo systemctl enable --now blackbox_exporter
curl -s "http://127.0.0.1:9115/metrics" | head
Verify one probe:
curl -s "http://127.0.0.1:9115/probe?module=http_2xx&target=https://example.com" | head

6) Grafana on mon-01

6.1 Install (Debian/Ubuntu repo method)

Install via your normal package method, then:

sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server --no-pager

6.2 Add Prometheus datasource

In Grafana UI http://mon-01:3000:

  • Connections -> Data sources -> Add data source -> Prometheus
  • URL: http://localhost:9090
  • Save & test

6.3 Import dashboards

Import these types of dashboards:

  • Node Exporter full
  • Blackbox exporter
    If you want IDs, tell me your preferred dashboard source, but the easiest is the Grafana dashboard library search for:
  • Node Exporter
  • Prometheus Blackbox Exporter

7) OpenTelemetry Collector on mon-01 (traces receiver)

This sets up a collector that receives OTLP traces over gRPC and HTTP. It also exposes its own Prometheus metrics.

7.1 Create user and dirs

sudo useradd --no-create-home --shell /usr/sbin/nologin otelcol || true
sudo mkdir -p /etc/otelcol /var/lib/otelcol
sudo chown -R otelcol:otelcol /etc/otelcol /var/lib/otelcol

7.2 Install binary

cd /tmp
# put otelcol-contrib_*_linux_amd64.tar.gz on mon-01
tar -xzf otelcol-contrib_*_linux_amd64.tar.gz
sudo cp otelcol-contrib /usr/local/bin/otelcol-contrib
sudo chown otelcol:otelcol /usr/local/bin/otelcol-contrib
sudo chmod 0755 /usr/local/bin/otelcol-contrib

7.3 Collector config

Create /etc/otelcol/config.yml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
exporters:
  logging:
    loglevel: info

  # Prometheus scrape endpoint for collector metrics
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]

This exports traces to logs only. Replace logging with your tracing backend exporter when ready.

7.4 Systemd unit

Create /etc/systemd/system/otelcol-contrib.service:

[Unit]
Description=OpenTelemetry Collector Contrib
Wants=network-online.target
After=network-online.target

[Service]
User=otelcol
Group=otelcol
Type=simple
ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol/config.yml
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

7.5 Start

sudo systemctl daemon-reload
sudo systemctl enable --now otelcol-contrib
sudo systemctl status otelcol-contrib --no-pager

7.6 Add collector metrics to Prometheus

Add to /etc/prometheus/prometheus.yml:

- job_name: "otelcol"
    static_configs:
      - targets: ["mon-01:8889"]

Reload Prometheus.

8) Validate everything end-to-end

8.1 Prometheus targets

Open:

  • http://mon-01:9090/targets
    Confirm UP:
  • node (app-01, app-02)
  • blackbox jobs
  • otelcol (if added)

8.2 Alerts

Open:

  • http://mon-01:9090/alerts
    Confirm rules loaded.

8.3 Alertmanager

Open:

  • http://mon-01:9093
    Confirm it receives alerts when you intentionally trigger one.

8.4 Grafana

Open:

  • http://mon-01:3000
    Confirm the Prometheus data source is working, and the dashboards show data.

8.5 Node Exporter reachability

From mon-01:

curl -s http://app-01:9100/metrics | head
curl -s http://app-02:9100/metrics | head

8.6 Blackbox probe check

From mon-01:

curl -s "http://127.0.0.1:9115/probe?module=icmp&target=app-01" | head

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link