Stack
- Ubuntu 24.04 LTS
- Ollama · Hermes 3 8B
- CPU-only
Table of Contents
- Overview and architecture
- Prerequisites and hardware requirements
- Installing Ollama
- Verifying the service
- Pulling the Hermes 3 model
- Configuring Ollama
- Creating a custom Modelfile
- Testing the installation
- Troubleshooting
Overview and architecture
This guide walks through running Hermes 3, NousResearch’s flagship instruction-following and agentic LLM, locally on Ubuntu 24.04 using Ollama as the runtime layer. No GPU is required.
Understanding what each piece does helps when things go wrong:
| Component | Role | Installed to |
|---|---|---|
| Ollama | Downloads, serves, and manages GGUF model files. Exposes a REST API on 127.0.0.1:11434 and a CLI wrapper. Runs as a systemd service. | /usr/local/bin/ollama |
| Hermes 3 8B | The actual model weights (Q4_K_M GGUF, ~5 GB on disk). Built on Llama 3.1 8B, fine-tuned by NousResearch for stronger instruction following, multi-turn conversation, and agentic tool use. | ~/.ollama/models/ |
| systemd unit | Keeps the Ollama server process alive across reboots. Created automatically by the installer. | /etc/systemd/system/ollama.service |
Why Hermes 3 8B specifically? Hermes 3 is the current default hermes3 tag in the Ollama library. The 8B variant runs on CPU-only hardware with 8–16 GB RAM, requires no driver setup, and outperforms Hermes 2 on reasoning, role adherence, and long-context coherence. A 70B variant exists for systems with 48 GB+ RAM or a capable GPU.
Prerequisites and hardware requirements
Before starting, confirm the following:
- Ubuntu 24.04 LTS (fresh install or existing), amd64 or arm64
- At least 8 GB RAM (16 GB recommended for comfortable headroom alongside the OS)
- At least 10 GB free disk space (the model download is ~5 GB; leave room for the binary and temp files)
sudoaccess on the machine- Internet connectivity (to download Ollama and the model)
RAM note The 8B model at Q4_K_M quantization uses roughly 5 GB for weights. The KV cache adds 2–4 GB more depending on context length. On a system with exactly 8 GB RAM, close other memory-heavy processes before running inference.
Install curl and zstd
The Ollama installer script requires both. zstd is used to extract the compressed binary archive.
sudo apt update
sudo apt install -y curl zstdInstalling Ollama
Ollama provides an official installer script. It detects your system architecture, downloads the correct binary, places it in /usr/local/bin, creates a dedicated ollama system user, and registers a systemd service.
Security practice If you prefer to review scripts before running them, download first: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh, then run sh install.sh when satisfied.
curl -fsSL https://ollama.com/install.sh | shThe installer prints its progress to stdout. A successful run ends with a line confirming the service has been enabled and started.
Confirm the binary is in your path:
ollama --versionollama version is 0.x.x
What the installer creates A system user named ollama (no login shell) owns the service process. Models are stored under /usr/share/ollama/.ollama/models/ when run as a service, or under ~/.ollama/models/ when run as your own user. The installer adds your current user to the ollama group so you can interact with the CLI without sudo.
Verifying the service
Check that the systemd unit is active and set to start on boot:
sudo systemctl status ollamaThe output should show active (running) and enabled:
● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
Active: active (running) since ... Main PID: 12345 (ollama)
Tasks: 9
Memory: 30.2M
CPU: 210ms
CGroup: /system.slice/ollama.service
└─12345 /usr/local/bin/ollama serve
Confirm the API is responding on localhost:
curl -fsS http://127.0.0.1:11434/api/version{"version":"0.x.x"}
If either check fails, restart the service and check logs:
sudo systemctl restart ollama
sudo journalctl -u ollama -n 50 --no-pagerPulling the Hermes 3 model
Ollama’s pull command downloads and caches the model. The default hermes3:8b tag fetches the Hermes 3 built on Llama 3.1 8B at Q4_K_M quantization – a good balance of output quality and memory footprint for CPU-only systems.
ollama pull hermes3:8bThe download is approximately 4.9 GB. Progress is shown inline.
Verify the model is listed locally:
ollama listNAME ID SIZE MODIFIED hermes3:8b a1b2c3d4e5f6 4.9 GB X seconds ago
Alternative model sizes:
hermes3:8bis the recommended starting point. If your system has significantly more RAM and you want higher quality at the cost of speed,hermes3:70bis available but requires 48+ GB of RAM for CPU-only inference and is impractically slow on most hardware without a GPU.
Configuring Ollama
Ollama reads configuration from environment variables. On systemd installations, the correct place to set these is a service override file, not your shell profile.
Common environment variables
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Interface and port the API listens on |
OLLAMA_MODELS | /usr/share/ollama/.ollama/models | Directory where model files are stored |
OLLAMA_NUM_PARALLEL | 1 | Number of parallel requests to handle |
OLLAMA_MAX_LOADED_MODELS | 1 | Maximum models to keep in memory simultaneously |
OLLAMA_KEEP_ALIVE | 5m | How long to keep a model loaded after the last request |
Setting configuration via systemd override
Create an override directory for the Ollama service unit:
sudo mkdir -p /etc/systemd/system/ollama.service.dCreate the override file:
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep the model loaded for 10 minutes between requests
Environment="OLLAMA_KEEP_ALIVE=10m"
# Restrict to localhost (default, but explicit is better)
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOFReload systemd and restart Ollama to apply changes:
sudo systemctl daemon-reload
sudo systemctl restart ollamaExposing Ollama on the network:
By default, Ollama only listens on 127.0.0.1 and is not accessible from other machines. If you need network access, set OLLAMA_HOST=0.0.0.0:11434 and restrict access at the firewall level with ufw. Never expose port 11434 publicly without authentication in front of it.
Creating a custom Modelfile
A Modelfile lets you customise how Hermes 3 behaves: inject a persistent system prompt, adjust inference parameters, or create a named variant. This is optional but useful in practice.
Create a working directory and write a Modelfile:
mkdir -p ~/ollama-models
cat > ~/ollama-models/Modelfile.hermes3 <<'EOF'
# Base model to extend
FROM hermes3:8b
# System prompt - sets persistent context for all conversations
SYSTEM """
You are a helpful and precise AI assistant. Answer questions directly
and thoroughly. When writing code, prefer readability over brevity.
When unsure, say so rather than guessing.
"""
# Inference parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
EOFBuild the custom variant under a local name:
ollama create hermes3-custom -f ~/ollama-models/Modelfile.hermes3Confirm it appears in the model list alongside the base:
ollama listKey Modelfile parameters
| Parameter | Typical range | Effect |
|---|---|---|
temperature | 0.0 – 1.0 | Higher values produce more varied output; lower values make it more deterministic |
top_p | 0.7 – 0.95 | Controls nucleus sampling; lower values restrict to higher-probability tokens |
num_ctx | 2048 – 131072 | Context window in tokens; larger values use more RAM |
repeat_penalty | 1.0 – 1.3 | Penalises repeating the same tokens; values above 1.15 can hurt coherence |
Testing the installation
The following tests move from basic connectivity through to a meaningful generation check. Run them in order; each one builds confidence before the next.
Test 1 -> Service health
Verify Ollama is running and the API responds:
sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/versionTest 2 -> Model is listed
Confirm the model was downloaded and is registered:
ollama listExpected: at least hermes3:8b appears in the output.
Test 3 -> Single-shot CLI inference
Send a one-shot prompt from the command line. This confirms the model loads and generates output without hanging:
ollama run hermes3:8b "Explain what a transformer architecture is in two sentences."On a CPU-only system expect the first token in 5–30 seconds as the model loads into RAM, then streaming output at ~5–15 tokens per second depending on hardware. A response arriving at all means the model is functional.
Check memory usage during inference Run free -h in a second terminal while inference is happening to confirm the model fits comfortably in RAM. If the system starts swapping, responses will slow dramatically and you may need to close other processes.
Test 4 -> Interactive session
Open a multi-turn interactive session to test conversational context retention:
ollama run hermes3:8bAt the prompt, enter the following two messages in sequence:
>>> My name is Ivan.
Nice to meet you, Ivan. How can I help you today?
>>> What is my name?
Your name is Ivan.
Exit with /bye or Ctrl+D.
Test 5 -> Custom model variant
If you created the custom Modelfile in section 07, verify it runs and the system prompt is active:
ollama run hermes3-custom "Write a Python function to reverse a string."Expect clean, readable code with comments – reflecting the system prompt instruction to prefer readability over brevity.
Test 6 -> Check model process state
After a successful inference, confirm the model is loaded in memory:
ollama psThe PROCESSOR column will read 100% CPU on a CPU-only system. UNTIL reflects the OLLAMA_KEEP_ALIVE value set earlier. When this timer expires, the model is unloaded from RAM and reloads on the next request.
All tests passed?
If the service is active, the model is listed, single-shot and interactive prompts both produce sensible output, and ollama ps shows the model loaded, your installation is complete and working correctly.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
curl to :11434 returns connection refused | Service not running | sudo systemctl start ollama then check journalctl -u ollama -n 30 |
ollama command not found after install | Shell PATH not updated | Log out and back in, or run source ~/.bashrc. The binary is at /usr/local/bin/ollama. |
| Model pull stalls or fails mid-download | Network interruption | Re-run ollama pull hermes3:8b – it resumes from where it stopped |
| Inference extremely slow (1 token/30s+) | System swapping; not enough free RAM | Close other applications; check free -h during inference. Consider a smaller context window in the Modelfile: PARAMETER num_ctx 2048 |
ollama create fails with “model not found” | Base model not pulled yet | Run ollama pull hermes3:8b before ollama create |
| Service unit changes not taking effect | Forgot to reload systemd | sudo systemctl daemon-reload && sudo systemctl restart ollama |
ollama ps shows model unloaded after each request | OLLAMA_KEEP_ALIVE too short or set to 0 | Set Environment="OLLAMA_KEEP_ALIVE=10m" in the systemd override (see section 06) |
Useful diagnostic commands
# Follow the service log in real time
sudo journalctl -u ollama -f
# Check available disk space for model storage
df -h /usr/share/ollama
# Check available RAM
free -h
# Inspect the installed service unit
cat /etc/systemd/system/ollama.service
# Remove a model if you need to free disk space
ollama rm hermes3:8b