Microservices Monitoring Architecture

This is a concrete build guide for a microservices monitoring architecture on Linux, using a stack that fits the monitoring ecosystem we used earlier:

If you run microservices on bare metal VMs with systemd services, you can skip cAdvisor. If you run containers, keep it.

This guide is written for tech-savvy readers but stays step-by-step.

Target architecture

On each Linux node that runs microservices

  • node_exporter on :9100
  • cAdvisor on :8080 (containers only)
  • optional OpenTelemetry Collector on :4317/:4318 if you will do tracing later

Central monitoring host

  • Prometheus on :9090
  • Alertmanager on :9093
  • Grafana on :3000
  • Blackbox exporter on :9115

Prometheus scrapes exporters and evaluates alerts. Grafana queries Prometheus. Alertmanager sends alerts.

0) Prereqs

  • One monitoring server (VM is fine) running Linux
  • Linux microservices nodes reachable from monitoring server
  • Firewall rules:
    • monitoring host can reach nodes on 9100 and 8080 and any blackbox probe targets
    • your admin IP can reach Grafana 3000, Prometheus 9090, Alertmanager 9093

1) Install Node Exporter on every Linux microservices node

1.1 Create user and install binary

sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter || true
cd /tmp
# copy the tarball to the server, then:
tar -xzf node_exporter-*.linux-amd64.tar.gz
sudo cp node_exporter-*/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

1.2 Systemd service

Create /etc/systemd/system/node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=":9100" \
  --collector.systemd \
  --collector.processes
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Enable and start

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
curl -s http://127.0.0.1:9100/metrics | head

This covers CPU, memory, disk, and network at host level.

2) If you run containers: install cAdvisor on each node

cAdvisor exposes container CPU, memory, network, and filesystem metrics.

2.1 Run cAdvisor with Docker

docker run -d --name=cadvisor --restart=always \
  --privileged \
  -p 8080:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:rw \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor:latest

Verify

curl -s http://127.0.0.1:8080/metrics | head

If you do not use Docker, you can run cAdvisor under containerd as well, but Docker is the simplest baseline.

3) Install Prometheus on the monitoring host

3.1 Create user and dirs

sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /var/lib/prometheus /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

3.2 Install binaries

cd /tmp
tar -xzf prometheus-*.linux-amd64.tar.gz
sudo cp prometheus-*/prometheus prometheus-*/promtool /usr/local/bin/
sudo cp -r prometheus-*/consoles prometheus-*/console_libraries /etc/prometheus/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo chown -R prometheus:prometheus /etc/prometheus

3.3 Prometheus config with jobs for node_exporter and cAdvisor

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["127.0.0.1:9093"]

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets:
          - "srv-01:9100"
          - "srv-02:9100"
  - job_name: "cadvisor"
    static_configs:
      - targets:
          - "srv-01:8080"
          - "srv-02:8080"

3.4 Systemd service

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=":9090"
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

Check targets: http://monitoring-01:9090/targets

4) Install Alertmanager on the monitoring host

4.1 Install

sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
cd /tmp
tar -xzf alertmanager-*.linux-amd64.tar.gz
sudo cp alertmanager-*/alertmanager alertmanager-*/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool

4.2 Basic config

Create /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "instance", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"

receivers:
  - name: "default"
    email_configs:
      - to: "[email protected]"
        from: "[email protected]"
        smarthost: "smtp.example.com:587"
        auth_username: "[email protected]"
        auth_password: "REPLACE_ME"
        send_resolved: true

4.3 Systemd service

Create /etc/systemd/system/alertmanager.service:

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=":9093"
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start

sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

5) Add microservices-ready alert rules

Create /etc/prometheus/rules/microservices-core.yml:

groups:
- name: microservices-core
  rules:

  # Host CPU saturation
  - alert: HostHighCPU
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage > 90% for 10m."

  # Host memory pressure indicator
  - alert: HostLowMemAvailable
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "Low MemAvailable on {{ $labels.instance }}"
      description: "MemAvailable < 10% for 10m."

  # Disk space low
  - alert: HostDiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
    for: 15m
    labels: {severity: warning}
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Filesystem free < 10% for 15m."

  # Disk busy time is consistently high (contention signal)
  - alert: HostDiskBusy
    expr: rate(node_disk_io_time_seconds_total[5m]) > 0.6
    for: 15m
    labels: {severity: warning}
    annotations:
      summary: "Disk device busy on {{ $labels.instance }}"
      description: "Device busy > 60% of time for 15m."

  # Network errors and drops
  - alert: HostNetworkRxErrors
    expr: rate(node_network_receive_errs_total[5m]) > 0
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "Network RX errors on {{ $labels.instance }}"
      description: "Receive errors increasing."

  - alert: HostNetworkRxDrops
    expr: rate(node_network_receive_drop_total[5m]) > 0
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "Network RX drops on {{ $labels.instance }}"
      description: "Receive drops increasing."

Validate and reload:

sudo promtool check rules /etc/prometheus/rules/microservices-core.yml
sudo systemctl reload prometheus || sudo systemctl restart prometheus

This gives you sane baseline alerting for the infrastructure layer.

6) Add synthetic monitoring for service endpoints with Blackbox Exporter

Microservices issues are frequently dependency or network path related. Add synthetic probes.

6.1 Install and run Blackbox Exporter

On monitoring host:

sudo useradd --no-create-home --shell /usr/sbin/nologin blackbox || true
sudo mkdir -p /etc/blackbox_exporter
sudo chown -R blackbox:blackbox /etc/blackbox_exporter
cd /tmp
tar -xzf blackbox_exporter-*.linux-amd64.tar.gz
sudo cp blackbox_exporter-*/blackbox_exporter /usr/local/bin/
sudo chown blackbox:blackbox /usr/local/bin/blackbox_exporter

Create /etc/blackbox_exporter/blackbox.yml:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      preferred_ip_protocol: "ip4"
      valid_http_versions: ["HTTP/1.1","HTTP/2"]
  tcp_connect:
    prober: tcp
    timeout: 5s
  icmp:
    prober: icmp
    timeout: 5s

Systemd service /etc/systemd/system/blackbox_exporter.service:

[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=blackbox
Group=blackbox
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter --config.file=/etc/blackbox_exporter/blackbox.yml --web.listen-address=":9115"
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start:

sudo systemctl daemon-reload
sudo systemctl enable --now blackbox_exporter

6.2 Configure Prometheus to scrape Blackbox probes

Add to /etc/prometheus/prometheus.yml under scrape_configs:

- job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]

    static_configs:
      - targets:
          - https://service-a.example.com/health
          - https://service-b.example.com/health

    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115

Reload Prometheus.

This gives you:

  • availability checks
  • latency at the edge
  • a clean signal that maps to user experience

7) Install Grafana and build a minimal dashboard set

Install Grafana on monitoring host, then:

7.1 Add Prometheus data source

  • URL: http://localhost:9090

7.2 Build three dashboards (keep it simple)

Dashboard A: Host overview

  • CPU usage by mode
  • Load average
  • MemAvailable ratio
  • Disk free per filesystem
  • Network bytes in/out
  • Network drops/errors

Dashboard B: Container overview (if using cAdvisor)

  • Container CPU usage
  • Container memory usage and limits
  • Container restarts (if you also scrape kube metrics later)
  • Container network throughput

Dashboard C: Synthetic checks

  • Probe success
  • Probe duration
  • Probe DNS and connect timings (blackbox provides these)

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link