This guide sets up a practical monitoring ecosystem for CPU, memory, disk I/O, and network on Linux servers using:
- Prometheus for metrics collection and storage
- Node Exporter on each Linux server for OS and hardware metrics
- Grafana for dashboards
- Alertmanager for alert routing and grouping
You can run everything on one monitoring VM or split it later. Steps below assume Ubuntu/Debian, but the same approach works on most Linux distros with minor package and service path differences.
Architecture
- Each Linux server runs node_exporter (pull model)
- A central Prometheus scrapes node_exporter endpoints
- Prometheus evaluates alert rules
- Alertmanager receives alerts and sends notifications (email, Slack, PagerDuty, OpsGenie, etc.)
- Grafana queries Prometheus and renders dashboards
0) Prereqs and conventions
- Monitoring host: monitoring-01
- Monitored servers: srv-01, srv-02, etc.
- Ports:
- Node Exporter: 9100
- Prometheus: 9090
- Alertmanager: 9093
- Grafana: 3000
Firewall minimum:
- Allow monitoring-01 to reach srv-* on TCP 9100
- Allow your admin IPs to reach monitoring-01 on 9090, 9093, 3000
Create a dedicated user on the monitoring host (recommended).
1) Install Node Exporter on each Linux server
1.1 Create a user and directories
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter || true
sudo mkdir -p /opt/node_exporter
cd /opt/node_exporter1.2 Download and install Node Exporter
Pick the latest Linux amd64 release you trust and copy it to the server. Example:
sudo tar -xzf node_exporter-*.linux-amd64.tar.gz
sudo cp node_exporter-*.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter1.3 Systemd service
Create /etc/systemd/system/node_exporter.service:
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=":9100" \
--collector.systemd \
--collector.processes
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter --no-pagerVerify locally:
curl -s http://127.0.0.1:9100/metrics | headThat is enough to expose:
- CPU: usage by mode, load averages, context switches
- Memory: MemAvailable, swap, paging stats
- Disk: filesystem space, diskstats (I/O), iowait context via CPU metrics
- Network: interface bytes, errors, drops, TCP stats
2) Install Prometheus on the monitoring host
2.1 Create user and directories
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus2.2 Download Prometheus and install binaries
cd /tmp
sudo tar -xzf prometheus-*.linux-amd64.tar.gz
sudo cp prometheus-*/prometheus /usr/local/bin/
sudo cp prometheus-*/promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo cp -r prometheus-*/consoles prometheus-*/console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus2.3 Prometheus config
Create /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["127.0.0.1:9093"]
scrape_configs:
- job_name: "node"
static_configs:
- targets:
- "srv-01:9100"
- "srv-02:9100"
labels:
env: "prod"Create rules directory:
sudo mkdir -p /etc/prometheus/rules
sudo chown -R prometheus:prometheus /etc/prometheus/rules2.4 Systemd service for Prometheus
Create /etc/systemd/system/prometheus.service:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--web.listen-address=":9090"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.targetStart:
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus --no-pagerVerify:
- Open Prometheus UI on
http://monitoring-01:9090 - Check Status -> Targets and confirm nodes are UP
3) Add alerting with Alertmanager
3.1 Install Alertmanager
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
cd /tmp
sudo tar -xzf alertmanager-*.linux-amd64.tar.gz
sudo cp alertmanager-*/alertmanager /usr/local/bin/
sudo cp alertmanager-*/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool3.2 Alertmanager config
Create /etc/alertmanager/alertmanager.yml (example routes to email, replace with your receiver of choice):
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "instance", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
receivers:
- name: "default"
email_configs:
- to: "[email protected]"
from: "[email protected]"
smarthost: "smtp.example.com:587"
auth_username: "[email protected]"
auth_password: "REPLACE_ME"
send_resolved: true3.3 Systemd service
Create /etc/systemd/system/alertmanager.service:
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=":9093"
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.targetStart:
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager --no-pager4) Create alert rules for CPU, memory, disk I/O, network
Create /etc/prometheus/rules/linux-core.yml:
groups:
- name: linux-core
rules:
# CPU: sustained high usage (ignores iowait as part of "usage" here)
- alert: HostHighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage > 90% for 10m."
# CPU: high iowait indicates storage bottleneck symptoms
- alert: HostHighIOWait
expr: (avg by(instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU iowait on {{ $labels.instance }}"
description: "IO wait > 10% for 10m (often disk contention or slow storage)."
# Memory: low MemAvailable
- alert: HostLowMemoryAvailable
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 10m
labels:
severity: warning
annotations:
summary: "Low available memory on {{ $labels.instance }}"
description: "MemAvailable < 10% for 10m."
# Swap: sustained swap in use and paging activity
- alert: HostSwapInUse
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes > 0.25
for: 15m
labels:
severity: warning
annotations:
summary: "Swap usage high on {{ $labels.instance }}"
description: "Swap usage > 25% for 15m."
# Disk space: filesystem almost full (exclude tmpfs)
- alert: HostDiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Filesystem free < 10% for 15m."
# Disk I/O: elevated disk time (busy) per device, rough contention signal
- alert: HostDiskDeviceBusy
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.6
for: 15m
labels:
severity: warning
annotations:
summary: "Disk device busy on {{ $labels.instance }}"
description: "Disk io_time > 0.6s/s for 15m (device is busy most of the time)."
# Network: interface receive errors or drops increasing
- alert: HostNetworkRxErrors
expr: rate(node_network_receive_errs_total[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Network receive errors on {{ $labels.instance }}"
description: "Network RX errors increasing."
- alert: HostNetworkRxDrops
expr: rate(node_network_receive_drop_total[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Network receive drops on {{ $labels.instance }}"
description: "Network RX drops increasing."Validate and reload:
sudo promtool check config /etc/prometheus/prometheus.yml
sudo promtool check rules /etc/prometheus/rules/linux-core.yml
sudo systemctl reload prometheus || sudo systemctl restart prometheusNotes for sane alerting:
- Start with warning thresholds and longer for: durations
- Alert on symptoms that require action, not on every spike
- Add labels like team, service, env once you have ownership mapping
5) Install Grafana and build dashboards
Install Grafana from your distro repo or Grafana’s packages. Then:
- Open Grafana UI on
http://monitoring-01:3000 - Add Prometheus data source:
- URL:
http://localhost:9090
- URL:
- Import a Node Exporter dashboard:
- In Grafana, go to
Dashboards -> Import - Search for a Node Exporter dashboard in Grafana’s dashboard library
- Select your Prometheus data source
- In Grafana, go to
Recommended panels to keep front-and-center:
- CPU usage by mode (user, system, iowait, idle)
- Load average vs CPU cores
- Memory: MemAvailable, cache, swap used, major page faults if available
- Disk: filesystem free %, disk busy time, read/write throughput
- Network: bytes in/out, drops, errors, retransmits, if you collect them
6) Disk I/O and network depth
Node Exporter covers most needs. Two practical additions are worth considering when you want more clarity:
A) Add network reachability and latency checks (optional)
Use Blackbox Exporter to probe:
- ICMP ping latency and packet loss to critical endpoints
- HTTP checks for service health
This helps with true network symptoms, not just interface counters.
B) Add per-service metrics later
For databases and apps, use their exporters once you have the base host layer stable. Do not start here. Start with the node metrics first.
7) What “good” looks like after setup
After you finish, you should be able to answer these quickly:
- CPU: Is the box actually CPU-bound or just iowaiting
- Memory: Is it real pressure or just cache usage
- Disk: Is storage getting slower before filling up
- Network: Are there drops/errors or rising latency between dependencies
If an incident happens, you can correlate:
- latency and errors at the service level (later)
- with CPU iowait, disk device busy time, memory pressure, and network drops
That is the core ecosystem.