Linux Monitoring Stack How-To: Prometheus, Node Exporter, Alertmanager, Grafana

This guide sets up a practical monitoring ecosystem for CPU, memory, disk I/O, and network on Linux servers using:

  • Prometheus for metrics collection and storage
  • Node Exporter on each Linux server for OS and hardware metrics
  • Grafana for dashboards
  • Alertmanager for alert routing and grouping

You can run everything on one monitoring VM or split it later. Steps below assume Ubuntu/Debian, but the same approach works on most Linux distros with minor package and service path differences.

Architecture

0) Prereqs and conventions

  • Monitoring host: monitoring-01
  • Monitored servers: srv-01, srv-02, etc.
  • Ports:
    • Node Exporter: 9100
    • Prometheus: 9090
    • Alertmanager: 9093
    • Grafana: 3000

Firewall minimum:

  • Allow monitoring-01 to reach srv-* on TCP 9100
  • Allow your admin IPs to reach monitoring-01 on 9090, 9093, 3000

Create a dedicated user on the monitoring host (recommended).

1) Install Node Exporter on each Linux server

1.1 Create a user and directories

sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter || true
sudo mkdir -p /opt/node_exporter
cd /opt/node_exporter

1.2 Download and install Node Exporter

Pick the latest Linux amd64 release you trust and copy it to the server. Example:

sudo tar -xzf node_exporter-*.linux-amd64.tar.gz
sudo cp node_exporter-*.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

1.3 Systemd service

Create /etc/systemd/system/node_exporter.service:

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=":9100" \
  --collector.systemd \
  --collector.processes
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter --no-pager

Verify locally:

curl -s http://127.0.0.1:9100/metrics | head

That is enough to expose:

  • CPU: usage by mode, load averages, context switches
  • Memory: MemAvailable, swap, paging stats
  • Disk: filesystem space, diskstats (I/O), iowait context via CPU metrics
  • Network: interface bytes, errors, drops, TCP stats

2) Install Prometheus on the monitoring host

2.1 Create user and directories

sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

2.2 Download Prometheus and install binaries

cd /tmp
sudo tar -xzf prometheus-*.linux-amd64.tar.gz
sudo cp prometheus-*/prometheus /usr/local/bin/
sudo cp prometheus-*/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo cp -r prometheus-*/consoles prometheus-*/console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus

2.3 Prometheus config

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["127.0.0.1:9093"]

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets:
          - "srv-01:9100"
          - "srv-02:9100"
        labels:
          env: "prod"

Create rules directory:

sudo mkdir -p /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus/rules

2.4 Systemd service for Prometheus

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --web.listen-address=":9090"
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start:

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus --no-pager

Verify:

  • Open Prometheus UI on http://monitoring-01:9090
  • Check Status -> Targets and confirm nodes are UP

3) Add alerting with Alertmanager

3.1 Install Alertmanager

sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
cd /tmp
sudo tar -xzf alertmanager-*.linux-amd64.tar.gz
sudo cp alertmanager-*/alertmanager /usr/local/bin/
sudo cp alertmanager-*/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool

3.2 Alertmanager config

Create /etc/alertmanager/alertmanager.yml (example routes to email, replace with your receiver of choice):

global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "instance", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"

receivers:
  - name: "default"
    email_configs:
      - to: "[email protected]"
        from: "[email protected]"
        smarthost: "smtp.example.com:587"
        auth_username: "[email protected]"
        auth_password: "REPLACE_ME"
        send_resolved: true

3.3 Systemd service

Create /etc/systemd/system/alertmanager.service:

[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=":9093"
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Start:

sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager --no-pager

4) Create alert rules for CPU, memory, disk I/O, network

Create /etc/prometheus/rules/linux-core.yml:

groups:
- name: linux-core
  rules:

  # CPU: sustained high usage (ignores iowait as part of "usage" here)
  - alert: HostHighCPU
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage > 90% for 10m."

  # CPU: high iowait indicates storage bottleneck symptoms
  - alert: HostHighIOWait
    expr: (avg by(instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100) > 10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU iowait on {{ $labels.instance }}"
      description: "IO wait > 10% for 10m (often disk contention or slow storage)."

  # Memory: low MemAvailable
  - alert: HostLowMemoryAvailable
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low available memory on {{ $labels.instance }}"
      description: "MemAvailable < 10% for 10m."

  # Swap: sustained swap in use and paging activity
  - alert: HostSwapInUse
    expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes > 0.25
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Swap usage high on {{ $labels.instance }}"
      description: "Swap usage > 25% for 15m."

  # Disk space: filesystem almost full (exclude tmpfs)
  - alert: HostDiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low on {{ $labels.instance }}"
      description: "Filesystem free < 10% for 15m."

  # Disk I/O: elevated disk time (busy) per device, rough contention signal
  - alert: HostDiskDeviceBusy
    expr: rate(node_disk_io_time_seconds_total[5m]) > 0.6
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Disk device busy on {{ $labels.instance }}"
      description: "Disk io_time > 0.6s/s for 15m (device is busy most of the time)."

  # Network: interface receive errors or drops increasing
  - alert: HostNetworkRxErrors
    expr: rate(node_network_receive_errs_total[5m]) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Network receive errors on {{ $labels.instance }}"
      description: "Network RX errors increasing."

  - alert: HostNetworkRxDrops
    expr: rate(node_network_receive_drop_total[5m]) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Network receive drops on {{ $labels.instance }}"
      description: "Network RX drops increasing."

Validate and reload:

sudo promtool check config /etc/prometheus/prometheus.yml
sudo promtool check rules /etc/prometheus/rules/linux-core.yml
sudo systemctl reload prometheus || sudo systemctl restart prometheus

Notes for sane alerting:

  • Start with warning thresholds and longer for: durations
  • Alert on symptoms that require action, not on every spike
  • Add labels like team, service, env once you have ownership mapping

5) Install Grafana and build dashboards

Install Grafana from your distro repo or Grafana’s packages. Then:

  1. Open Grafana UI on http://monitoring-01:3000
  2. Add Prometheus data source:
    • URL: http://localhost:9090
  3. Import a Node Exporter dashboard:
    • In Grafana, go to Dashboards -> Import
    • Search for a Node Exporter dashboard in Grafana’s dashboard library
    • Select your Prometheus data source

Recommended panels to keep front-and-center:

  • CPU usage by mode (user, system, iowait, idle)
  • Load average vs CPU cores
  • Memory: MemAvailable, cache, swap used, major page faults if available
  • Disk: filesystem free %, disk busy time, read/write throughput
  • Network: bytes in/out, drops, errors, retransmits, if you collect them

6) Disk I/O and network depth

Node Exporter covers most needs. Two practical additions are worth considering when you want more clarity:

A) Add network reachability and latency checks (optional)

Use Blackbox Exporter to probe:

  • ICMP ping latency and packet loss to critical endpoints
  • HTTP checks for service health

This helps with true network symptoms, not just interface counters.

B) Add per-service metrics later

For databases and apps, use their exporters once you have the base host layer stable. Do not start here. Start with the node metrics first.

7) What “good” looks like after setup

After you finish, you should be able to answer these quickly:

  • CPU: Is the box actually CPU-bound or just iowaiting
  • Memory: Is it real pressure or just cache usage
  • Disk: Is storage getting slower before filling up
  • Network: Are there drops/errors or rising latency between dependencies

If an incident happens, you can correlate:

  • latency and errors at the service level (later)
  • with CPU iowait, disk device busy time, memory pressure, and network drops

That is the core ecosystem.

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link