This is a how-to setup for how to monitor distributed systems. This is a complete distributed monitoring stack:
- Node Exporter (host metrics)
- Prometheus (scrape, store, alert rules)
- Alertmanager (route alerts)
- Grafana (dashboards)
- Blackbox Exporter (HTTP, TCP, ICMP probes)
- OpenTelemetry Collector (receive traces and export them)
Assumptions:
- Linux with systemd (Ubuntu/Debian style paths)
- One monitoring host: mon-01
- Two app nodes: app-01, app-02
- DNS or /etc/hosts resolves those names
- Ports open:
- Node Exporter 9100 on app nodes
- Prometheus 9090, Alertmanager 9093, Grafana 3000, Blackbox 9115, OTel Collector 4317/4318 on mon-01
- Replace example domains, emails, and endpoints
1) Node Exporter on every Linux node (app-01, app-02)
1.1 Create user
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter || true1.2 Install binary
cd /tmp
# put node_exporter-*.linux-amd64.tar.gz on the server
tar -xzf node_exporter-*.linux-amd64.tar.gz
sudo cp node_exporter-*.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo chmod 0755 /usr/local/bin/node_exporter1.3 Systemd unit
Create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=":9100" \
--collector.systemd \
--collector.processes
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target1.4 Start
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://127.0.0.1:9100/metrics | head2) Prometheus on mon-01
2.1 Create user and dirs
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /etc/prometheus/rules /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus2.2 Install binaries
cd /tmp
# put prometheus-*.linux-amd64.tar.gz on mon-01
tar -xzf prometheus-*.linux-amd64.tar.gz
sudo cp prometheus-*/prometheus prometheus-*/promtool /usr/local/bin/
sudo cp -r prometheus-*/consoles prometheus-*/console_libraries /etc/prometheus/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo chown -R prometheus:prometheus /etc/prometheus2.3 Prometheus config
Create /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["127.0.0.1:9093"]
scrape_configs:
- job_name: "node"
static_configs:
- targets:
- "app-01:9100"
- "app-02:9100"
labels:
env: "prod"
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://service-a.example.com/health
- https://service-b.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
- job_name: "blackbox-tcp"
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- db-01.example.com:5432
- mq-01.example.com:5672
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
- job_name: "blackbox-icmp"
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- app-01
- app-02
- db-01.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:91152.4 Prometheus systemd unit
Create /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.listen-address=":9090"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target2.5 Start
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus --no-pager
Verify:
http://mon-01:9090/targets3) Alertmanager on mon-01
3.1 Create user and dirs
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager3.2 Install binaries
cd /tmp
# put alertmanager-*.linux-amd64.tar.gz on mon-01
tar -xzf alertmanager-*.linux-amd64.tar.gz
sudo cp alertmanager-*/alertmanager alertmanager-*/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool3.3 Configure Alertmanager
Create /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ["alertname","instance","job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "email-default"
receivers:
- name: "email-default"
email_configs:
- to: "[email protected]"
from: "[email protected]"
smarthost: "smtp.example.com:587"
auth_username: "[email protected]"
auth_password: "REPLACE_ME"
send_resolved: true3.4 Systemd unit
Create /etc/systemd/system/alertmanager.service:
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=":9093"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target3.5 Start
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager --no-pager
Verify:
http://mon-01:90934) Prometheus alert rules (host + probes)
Create /etc/prometheus/rules/distributed.yml:
groups:
- name: distributed
rules:
- alert: HostDown
expr: up{job="node"} == 0
for: 2m
labels: {severity: critical}
annotations:
summary: "Host down: {{ $labels.instance }}"
- alert: HostHighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels: {severity: warning}
annotations:
summary: "High CPU on {{ $labels.instance }}"
- alert: HostLowMemAvailable
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 10m
labels: {severity: warning}
annotations:
summary: "Low MemAvailable on {{ $labels.instance }}"
- alert: HostDiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 15m
labels: {severity: warning}
annotations:
summary: "Disk space low on {{ $labels.instance }}"
- alert: HostDiskBusy
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.6
for: 15m
labels: {severity: warning}
annotations:
summary: "Disk busy on {{ $labels.instance }}"
- alert: ProbeHttpDown
expr: probe_success{job="blackbox-http"} == 0
for: 1m
labels: {severity: critical}
annotations:
summary: "HTTP probe failed: {{ $labels.instance }}"
- alert: ProbeTcpDown
expr: probe_success{job="blackbox-tcp"} == 0
for: 1m
labels: {severity: critical}
annotations:
summary: "TCP probe failed: {{ $labels.instance }}"
- alert: ProbeIcmpDown
expr: probe_success{job="blackbox-icmp"} == 0
for: 1m
labels: {severity: warning}
annotations:
summary: "ICMP probe failed: {{ $labels.instance }}"Validate and reload Prometheus:
sudo promtool check rules /etc/prometheus/rules/distributed.yml
sudo promtool check config /etc/prometheus/prometheus.yml
sudo systemctl reload prometheus || sudo systemctl restart prometheus5) Blackbox Exporter on mon-01
5.1 Create user and dirs
sudo useradd --no-create-home --shell /usr/sbin/nologin blackbox || true
sudo mkdir -p /etc/blackbox_exporter
sudo chown -R blackbox:blackbox /etc/blackbox_exporter5.2 Install binary
cd /tmp
# put blackbox_exporter-*.linux-amd64.tar.gz on mon-01
tar -xzf blackbox_exporter-*.linux-amd64.tar.gz
sudo cp blackbox_exporter-*/blackbox_exporter /usr/local/bin/
sudo chown blackbox:blackbox /usr/local/bin/blackbox_exporter
sudo chmod 0755 /usr/local/bin/blackbox_exporter5.3 Config
Create /etc/blackbox_exporter/blackbox.yml:
modules:
http_2xx:
prober: http
timeout: 5s
http:
preferred_ip_protocol: ip4
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s5.4 Systemd unit
Create /etc/systemd/system/blackbox_exporter.service:
[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=blackbox
Group=blackbox
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter \
--config.file=/etc/blackbox_exporter/blackbox.yml \
--web.listen-address=":9115"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target5.5 Start
sudo systemctl daemon-reload
sudo systemctl enable --now blackbox_exporter
curl -s "http://127.0.0.1:9115/metrics" | head
Verify one probe:
curl -s "http://127.0.0.1:9115/probe?module=http_2xx&target=https://example.com" | head6) Grafana on mon-01
6.1 Install (Debian/Ubuntu repo method)
Install via your normal package method, then:
sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server --no-pager6.2 Add Prometheus datasource
In Grafana UI http://mon-01:3000:
- Connections -> Data sources -> Add data source -> Prometheus
- URL:
http://localhost:9090 - Save & test
6.3 Import dashboards
Import these types of dashboards:
- Node Exporter full
- Blackbox exporter
If you want IDs, tell me your preferred dashboard source, but the easiest is the Grafana dashboard library search for: - Node Exporter
- Prometheus Blackbox Exporter
7) OpenTelemetry Collector on mon-01 (traces receiver)
This sets up a collector that receives OTLP traces over gRPC and HTTP. It also exposes its own Prometheus metrics.
7.1 Create user and dirs
sudo useradd --no-create-home --shell /usr/sbin/nologin otelcol || true
sudo mkdir -p /etc/otelcol /var/lib/otelcol
sudo chown -R otelcol:otelcol /etc/otelcol /var/lib/otelcol7.2 Install binary
cd /tmp
# put otelcol-contrib_*_linux_amd64.tar.gz on mon-01
tar -xzf otelcol-contrib_*_linux_amd64.tar.gz
sudo cp otelcol-contrib /usr/local/bin/otelcol-contrib
sudo chown otelcol:otelcol /usr/local/bin/otelcol-contrib
sudo chmod 0755 /usr/local/bin/otelcol-contrib7.3 Collector config
Create /etc/otelcol/config.yml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
logging:
loglevel: info
# Prometheus scrape endpoint for collector metrics
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [logging]This exports traces to logs only. Replace logging with your tracing backend exporter when ready.
7.4 Systemd unit
Create /etc/systemd/system/otelcol-contrib.service:
[Unit]
Description=OpenTelemetry Collector Contrib
Wants=network-online.target
After=network-online.target
[Service]
User=otelcol
Group=otelcol
Type=simple
ExecStart=/usr/local/bin/otelcol-contrib --config=/etc/otelcol/config.yml
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target7.5 Start
sudo systemctl daemon-reload
sudo systemctl enable --now otelcol-contrib
sudo systemctl status otelcol-contrib --no-pager7.6 Add collector metrics to Prometheus
Add to /etc/prometheus/prometheus.yml:
- job_name: "otelcol"
static_configs:
- targets: ["mon-01:8889"]Reload Prometheus.
8) Validate everything end-to-end
8.1 Prometheus targets
Open:
http://mon-01:9090/targets
Confirm UP:- node (app-01, app-02)
- blackbox jobs
- otelcol (if added)
8.2 Alerts
Open:
http://mon-01:9090/alerts
Confirm rules loaded.
8.3 Alertmanager
Open:
http://mon-01:9093
Confirm it receives alerts when you intentionally trigger one.
8.4 Grafana
Open:
http://mon-01:3000
Confirm the Prometheus data source is working, and the dashboards show data.
8.5 Node Exporter reachability
From mon-01:
curl -s http://app-01:9100/metrics | head
curl -s http://app-02:9100/metrics | head8.6 Blackbox probe check
From mon-01:
curl -s "http://127.0.0.1:9115/probe?module=icmp&target=app-01" | head