This is a concrete build guide for a microservices monitoring architecture on Linux, using a stack that fits the monitoring ecosystem we used earlier:
- Prometheus (metrics storage and querying)
- Alertmanager (alert routing and grouping)
- Grafana (dashboards)
- Node Exporter (Linux host metrics)
- cAdvisor (container level CPU, memory, network, filesystem)
- Blackbox Exporter (synthetic probes: HTTP, TCP, ICMP)
- OpenTelemetry Collector (optional but recommended) for traces and unified telemetry routing
If you run microservices on bare metal VMs with systemd services, you can skip cAdvisor. If you run containers, keep it.
This guide is written for tech-savvy readers but stays step-by-step.
Target architecture
On each Linux node that runs microservices
- node_exporter on :9100
- cAdvisor on :8080 (containers only)
- optional OpenTelemetry Collector on :4317/:4318 if you will do tracing later
Central monitoring host
- Prometheus on :9090
- Alertmanager on :9093
- Grafana on :3000
- Blackbox exporter on :9115
Prometheus scrapes exporters and evaluates alerts. Grafana queries Prometheus. Alertmanager sends alerts.
0) Prereqs
- One monitoring server (VM is fine) running Linux
- Linux microservices nodes reachable from monitoring server
- Firewall rules:
- monitoring host can reach nodes on 9100 and 8080 and any blackbox probe targets
- your admin IP can reach Grafana 3000, Prometheus 9090, Alertmanager 9093
1) Install Node Exporter on every Linux microservices node
1.1 Create user and install binary
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter || true
cd /tmp
# copy the tarball to the server, then:
tar -xzf node_exporter-*.linux-amd64.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter1.2 Systemd service
Create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=":9100" \
--collector.systemd \
--collector.processes
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.targetEnable and start
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://127.0.0.1:9100/metrics | headThis covers CPU, memory, disk, and network at host level.
2) If you run containers: install cAdvisor on each node
cAdvisor exposes container CPU, memory, network, and filesystem metrics.
2.1 Run cAdvisor with Docker
docker run -d --name=cadvisor --restart=always \
--privileged \
-p 8080:8080 \
-v /:/rootfs:ro \
-v /var/run:/var/run:rw \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
gcr.io/cadvisor/cadvisor:latestVerify
curl -s http://127.0.0.1:8080/metrics | headIf you do not use Docker, you can run cAdvisor under containerd as well, but Docker is the simplest baseline.
3) Install Prometheus on the monitoring host
3.1 Create user and dirs
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /var/lib/prometheus /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus3.2 Install binaries
cd /tmp
tar -xzf prometheus-*.linux-amd64.tar.gz
sudo cp prometheus-*/prometheus prometheus-*/promtool /usr/local/bin/
sudo cp -r prometheus-*/consoles prometheus-*/console_libraries /etc/prometheus/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo chown -R prometheus:prometheus /etc/prometheus3.3 Prometheus config with jobs for node_exporter and cAdvisor
Create /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["127.0.0.1:9093"]
scrape_configs:
- job_name: "node"
static_configs:
- targets:
- "srv-01:9100"
- "srv-02:9100"
- job_name: "cadvisor"
static_configs:
- targets:
- "srv-01:8080"
- "srv-02:8080"3.4 Systemd service
Create /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.listen-address=":9090"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.targetStart
sudo systemctl daemon-reload
sudo systemctl enable --now prometheusCheck targets: http://monitoring-01:9090/targets
4) Install Alertmanager on the monitoring host
4.1 Install
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
cd /tmp
tar -xzf alertmanager-*.linux-amd64.tar.gz
sudo cp alertmanager-*/alertmanager alertmanager-*/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool4.2 Basic config
Create /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "instance", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
receivers:
- name: "default"
email_configs:
- to: "[email protected]"
from: "[email protected]"
smarthost: "smtp.example.com:587"
auth_username: "[email protected]"
auth_password: "REPLACE_ME"
send_resolved: true4.3 Systemd service
Create /etc/systemd/system/alertmanager.service:
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=":9093"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.targetStart
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager5) Add microservices-ready alert rules
Create /etc/prometheus/rules/microservices-core.yml:
groups:
- name: microservices-core
rules:
# Host CPU saturation
- alert: HostHighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels: {severity: warning}
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage > 90% for 10m."
# Host memory pressure indicator
- alert: HostLowMemAvailable
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 10m
labels: {severity: warning}
annotations:
summary: "Low MemAvailable on {{ $labels.instance }}"
description: "MemAvailable < 10% for 10m."
# Disk space low
- alert: HostDiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 15m
labels: {severity: warning}
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Filesystem free < 10% for 15m."
# Disk busy time is consistently high (contention signal)
- alert: HostDiskBusy
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.6
for: 15m
labels: {severity: warning}
annotations:
summary: "Disk device busy on {{ $labels.instance }}"
description: "Device busy > 60% of time for 15m."
# Network errors and drops
- alert: HostNetworkRxErrors
expr: rate(node_network_receive_errs_total[5m]) > 0
for: 10m
labels: {severity: warning}
annotations:
summary: "Network RX errors on {{ $labels.instance }}"
description: "Receive errors increasing."
- alert: HostNetworkRxDrops
expr: rate(node_network_receive_drop_total[5m]) > 0
for: 10m
labels: {severity: warning}
annotations:
summary: "Network RX drops on {{ $labels.instance }}"
description: "Receive drops increasing."Validate and reload:
sudo promtool check rules /etc/prometheus/rules/microservices-core.yml
sudo systemctl reload prometheus || sudo systemctl restart prometheusThis gives you sane baseline alerting for the infrastructure layer.
6) Add synthetic monitoring for service endpoints with Blackbox Exporter
Microservices issues are frequently dependency or network path related. Add synthetic probes.
6.1 Install and run Blackbox Exporter
On monitoring host:
sudo useradd --no-create-home --shell /usr/sbin/nologin blackbox || true
sudo mkdir -p /etc/blackbox_exporter
sudo chown -R blackbox:blackbox /etc/blackbox_exporter
cd /tmp
tar -xzf blackbox_exporter-*.linux-amd64.tar.gz
sudo cp blackbox_exporter-*/blackbox_exporter /usr/local/bin/
sudo chown blackbox:blackbox /usr/local/bin/blackbox_exporterCreate /etc/blackbox_exporter/blackbox.yml:
modules:
http_2xx:
prober: http
timeout: 5s
http:
preferred_ip_protocol: "ip4"
valid_http_versions: ["HTTP/1.1","HTTP/2"]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5sSystemd service /etc/systemd/system/blackbox_exporter.service:
[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=blackbox
Group=blackbox
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter --config.file=/etc/blackbox_exporter/blackbox.yml --web.listen-address=":9115"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.targetStart:
sudo systemctl daemon-reload
sudo systemctl enable --now blackbox_exporter6.2 Configure Prometheus to scrape Blackbox probes
Add to /etc/prometheus/prometheus.yml under scrape_configs:
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://service-a.example.com/health
- https://service-b.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115Reload Prometheus.
This gives you:
- availability checks
- latency at the edge
- a clean signal that maps to user experience
7) Install Grafana and build a minimal dashboard set
Install Grafana on monitoring host, then:
7.1 Add Prometheus data source
- URL:
http://localhost:9090
7.2 Build three dashboards (keep it simple)
Dashboard A: Host overview
- CPU usage by mode
- Load average
- MemAvailable ratio
- Disk free per filesystem
- Network bytes in/out
- Network drops/errors
Dashboard B: Container overview (if using cAdvisor)
- Container CPU usage
- Container memory usage and limits
- Container restarts (if you also scrape kube metrics later)
- Container network throughput
Dashboard C: Synthetic checks
- Probe success
- Probe duration
- Probe DNS and connect timings (blackbox provides these)