Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04

Stack

Table of Contents

  1. Overview and architecture
  2. Prerequisites and hardware requirements
  3. Installing Ollama
  4. Verifying the service
  5. Pulling the Hermes 3 model
  6. Configuring Ollama
  7. Creating a custom Modelfile
  8. Testing the installation
  9. Troubleshooting

Overview and architecture

This guide walks through running Hermes 3, NousResearch’s flagship instruction-following and agentic LLM, locally on Ubuntu 24.04 using Ollama as the runtime layer. No GPU is required.

Understanding what each piece does helps when things go wrong:

ComponentRoleInstalled to
OllamaDownloads, serves, and manages GGUF model files. Exposes a REST API on 127.0.0.1:11434 and a CLI wrapper. Runs as a systemd service./usr/local/bin/ollama
Hermes 3 8BThe actual model weights (Q4_K_M GGUF, ~5 GB on disk). Built on Llama 3.1 8B, fine-tuned by NousResearch for stronger instruction following, multi-turn conversation, and agentic tool use.~/.ollama/models/
systemd unitKeeps the Ollama server process alive across reboots. Created automatically by the installer./etc/systemd/system/ollama.service

Why Hermes 3 8B specifically? Hermes 3 is the current default hermes3 tag in the Ollama library. The 8B variant runs on CPU-only hardware with 8–16 GB RAM, requires no driver setup, and outperforms Hermes 2 on reasoning, role adherence, and long-context coherence. A 70B variant exists for systems with 48 GB+ RAM or a capable GPU.

Prerequisites and hardware requirements

Before starting, confirm the following:

  • Ubuntu 24.04 LTS (fresh install or existing), amd64 or arm64
  • At least 8 GB RAM (16 GB recommended for comfortable headroom alongside the OS)
  • At least 10 GB free disk space (the model download is ~5 GB; leave room for the binary and temp files)
  • sudo access on the machine
  • Internet connectivity (to download Ollama and the model)

RAM note The 8B model at Q4_K_M quantization uses roughly 5 GB for weights. The KV cache adds 2–4 GB more depending on context length. On a system with exactly 8 GB RAM, close other memory-heavy processes before running inference.

Install curl and zstd

The Ollama installer script requires both. zstd is used to extract the compressed binary archive.

sudo apt update
sudo apt install -y curl zstd

Installing Ollama

Ollama provides an official installer script. It detects your system architecture, downloads the correct binary, places it in /usr/local/bin, creates a dedicated ollama system user, and registers a systemd service.

Security practice If you prefer to review scripts before running them, download first: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh, then run sh install.sh when satisfied.

curl -fsSL https://ollama.com/install.sh | sh

The installer prints its progress to stdout. A successful run ends with a line confirming the service has been enabled and started.

Confirm the binary is in your path:

ollama --version
ollama version is 0.x.x

What the installer creates A system user named ollama (no login shell) owns the service process. Models are stored under /usr/share/ollama/.ollama/models/ when run as a service, or under ~/.ollama/models/ when run as your own user. The installer adds your current user to the ollama group so you can interact with the CLI without sudo.

Verifying the service

Check that the systemd unit is active and set to start on boot:

sudo systemctl status ollama

The output should show active (running) and enabled:

● ollama.service - Ollama Service 
    Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled) 
    Active: active (running) since ... Main PID: 12345 (ollama) 
     Tasks: 9 
    Memory: 30.2M 
       CPU: 210ms 
    CGroup: /system.slice/ollama.service 
                  └─12345 /usr/local/bin/ollama serve

Confirm the API is responding on localhost:

curl -fsS http://127.0.0.1:11434/api/version
{"version":"0.x.x"}

If either check fails, restart the service and check logs:

sudo systemctl restart ollama
sudo journalctl -u ollama -n 50 --no-pager

Pulling the Hermes 3 model

Ollama’s pull command downloads and caches the model. The default hermes3:8b tag fetches the Hermes 3 built on Llama 3.1 8B at Q4_K_M quantization – a good balance of output quality and memory footprint for CPU-only systems.

ollama pull hermes3:8b

The download is approximately 4.9 GB. Progress is shown inline.

Verify the model is listed locally:

ollama list
NAME ID SIZE MODIFIED hermes3:8b a1b2c3d4e5f6 4.9 GB X seconds ago

Alternative model sizes:

  • hermes3:8b is the recommended starting point. If your system has significantly more RAM and you want higher quality at the cost of speed,
  • hermes3:70b is available but requires 48+ GB of RAM for CPU-only inference and is impractically slow on most hardware without a GPU.

Configuring Ollama

Ollama reads configuration from environment variables. On systemd installations, the correct place to set these is a service override file, not your shell profile.

Common environment variables

VariableDefaultPurpose
OLLAMA_HOST127.0.0.1:11434Interface and port the API listens on
OLLAMA_MODELS/usr/share/ollama/.ollama/modelsDirectory where model files are stored
OLLAMA_NUM_PARALLEL1Number of parallel requests to handle
OLLAMA_MAX_LOADED_MODELS1Maximum models to keep in memory simultaneously
OLLAMA_KEEP_ALIVE5mHow long to keep a model loaded after the last request

Setting configuration via systemd override

Create an override directory for the Ollama service unit:

sudo mkdir -p /etc/systemd/system/ollama.service.d

Create the override file:

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep the model loaded for 10 minutes between requests
Environment="OLLAMA_KEEP_ALIVE=10m"
# Restrict to localhost (default, but explicit is better)
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOF

Reload systemd and restart Ollama to apply changes:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Exposing Ollama on the network:

By default, Ollama only listens on 127.0.0.1 and is not accessible from other machines. If you need network access, set OLLAMA_HOST=0.0.0.0:11434 and restrict access at the firewall level with ufw. Never expose port 11434 publicly without authentication in front of it.

Creating a custom Modelfile

A Modelfile lets you customise how Hermes 3 behaves: inject a persistent system prompt, adjust inference parameters, or create a named variant. This is optional but useful in practice.

Create a working directory and write a Modelfile:

mkdir -p ~/ollama-models
cat > ~/ollama-models/Modelfile.hermes3 <<'EOF'
# Base model to extend
FROM hermes3:8b

# System prompt - sets persistent context for all conversations
SYSTEM """
You are a helpful and precise AI assistant. Answer questions directly
and thoroughly. When writing code, prefer readability over brevity.
When unsure, say so rather than guessing.
"""

# Inference parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
EOF

Build the custom variant under a local name:

ollama create hermes3-custom -f ~/ollama-models/Modelfile.hermes3

Confirm it appears in the model list alongside the base:

ollama list

Key Modelfile parameters

ParameterTypical rangeEffect
temperature0.0 – 1.0Higher values produce more varied output; lower values make it more deterministic
top_p0.7 – 0.95Controls nucleus sampling; lower values restrict to higher-probability tokens
num_ctx2048 – 131072Context window in tokens; larger values use more RAM
repeat_penalty1.0 – 1.3Penalises repeating the same tokens; values above 1.15 can hurt coherence

Testing the installation

The following tests move from basic connectivity through to a meaningful generation check. Run them in order; each one builds confidence before the next.

Test 1 -> Service health

Verify Ollama is running and the API responds:

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

Test 2 -> Model is listed

Confirm the model was downloaded and is registered:

ollama list

Expected: at least hermes3:8b appears in the output.

Test 3 -> Single-shot CLI inference

Send a one-shot prompt from the command line. This confirms the model loads and generates output without hanging:

ollama run hermes3:8b "Explain what a transformer architecture is in two sentences."

On a CPU-only system expect the first token in 5–30 seconds as the model loads into RAM, then streaming output at ~5–15 tokens per second depending on hardware. A response arriving at all means the model is functional.

Check memory usage during inference Run free -h in a second terminal while inference is happening to confirm the model fits comfortably in RAM. If the system starts swapping, responses will slow dramatically and you may need to close other processes.

Test 4 -> Interactive session

Open a multi-turn interactive session to test conversational context retention:

ollama run hermes3:8b

At the prompt, enter the following two messages in sequence:

>>> My name is Ivan. 
Nice to meet you, Ivan. How can I help you today?

>>> What is my name? 
Your name is Ivan.

Exit with /bye or Ctrl+D.

Test 5 -> Custom model variant

If you created the custom Modelfile in section 07, verify it runs and the system prompt is active:

ollama run hermes3-custom "Write a Python function to reverse a string."

Expect clean, readable code with comments – reflecting the system prompt instruction to prefer readability over brevity.

Test 6 -> Check model process state

After a successful inference, confirm the model is loaded in memory:

ollama ps

The PROCESSOR column will read 100% CPU on a CPU-only system. UNTIL reflects the OLLAMA_KEEP_ALIVE value set earlier. When this timer expires, the model is unloaded from RAM and reloads on the next request.

All tests passed?
If the service is active, the model is listed, single-shot and interactive prompts both produce sensible output, and ollama ps shows the model loaded, your installation is complete and working correctly.

Troubleshooting

SymptomLikely causeFix
curl to :11434 returns connection refusedService not runningsudo systemctl start ollama then check journalctl -u ollama -n 30
ollama command not found after installShell PATH not updatedLog out and back in, or run source ~/.bashrc. The binary is at /usr/local/bin/ollama.
Model pull stalls or fails mid-downloadNetwork interruptionRe-run ollama pull hermes3:8b – it resumes from where it stopped
Inference extremely slow (1 token/30s+)System swapping; not enough free RAMClose other applications; check free -h during inference. Consider a smaller context window in the Modelfile: PARAMETER num_ctx 2048
ollama create fails with “model not found”Base model not pulled yetRun ollama pull hermes3:8b before ollama create
Service unit changes not taking effectForgot to reload systemdsudo systemctl daemon-reload && sudo systemctl restart ollama
ollama ps shows model unloaded after each requestOLLAMA_KEEP_ALIVE too short or set to 0Set Environment="OLLAMA_KEEP_ALIVE=10m" in the systemd override (see section 06)

Useful diagnostic commands

# Follow the service log in real time
sudo journalctl -u ollama -f

# Check available disk space for model storage
df -h /usr/share/ollama

# Check available RAM
free -h

# Inspect the installed service unit
cat /etc/systemd/system/ollama.service

# Remove a model if you need to free disk space
ollama rm hermes3:8b

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link