Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04

Stack

Overview and architecture
Prerequisites and hardware requirements
Installing Ollama
Verifying the service
Pulling the Hermes 3 model
Configuring Ollama
Creating a custom Modelfile
Testing the installation
Troubleshooting

Overview and architecture

This guide walks through running Hermes 3, NousResearch’s flagship instruction-following and agentic LLM, locally on Ubuntu 24.04 using Ollama as the runtime layer. No GPU is required.

Understanding what each piece does helps when things go wrong:

Component	Role	Installed to
Ollama	Downloads, serves, and manages GGUF model files. Exposes a REST API on `127.0.0.1:11434` and a CLI wrapper. Runs as a systemd service.	`/usr/local/bin/ollama`
Hermes 3 8B	The actual model weights (Q4_K_M GGUF, ~5 GB on disk). Built on Llama 3.1 8B, fine-tuned by NousResearch for stronger instruction following, multi-turn conversation, and agentic tool use.	`~/.ollama/models/`
systemd unit	Keeps the Ollama server process alive across reboots. Created automatically by the installer.	`/etc/systemd/system/ollama.service`

Why Hermes 3 8B specifically? Hermes 3 is the current default hermes3 tag in the Ollama library. The 8B variant runs on CPU-only hardware with 8–16 GB RAM, requires no driver setup, and outperforms Hermes 2 on reasoning, role adherence, and long-context coherence. A 70B variant exists for systems with 48 GB+ RAM or a capable GPU.

Prerequisites and hardware requirements

Before starting, confirm the following:

Ubuntu 24.04 LTS (fresh install or existing), amd64 or arm64
At least 8 GB RAM (16 GB recommended for comfortable headroom alongside the OS)
At least 10 GB free disk space (the model download is ~5 GB; leave room for the binary and temp files)
sudo access on the machine
Internet connectivity (to download Ollama and the model)

RAM note The 8B model at Q4_K_M quantization uses roughly 5 GB for weights. The KV cache adds 2–4 GB more depending on context length. On a system with exactly 8 GB RAM, close other memory-heavy processes before running inference.

Install curl and zstd

The Ollama installer script requires both. zstd is used to extract the compressed binary archive.

sudo apt update
sudo apt install -y curl zstd

Installing Ollama

Ollama provides an official installer script. It detects your system architecture, downloads the correct binary, places it in /usr/local/bin, creates a dedicated ollama system user, and registers a systemd service.

Security practice If you prefer to review scripts before running them, download first: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh, then run sh install.sh when satisfied.

curl -fsSL https://ollama.com/install.sh | sh

The installer prints its progress to stdout. A successful run ends with a line confirming the service has been enabled and started.

Confirm the binary is in your path:

ollama --version

ollama version is 0.x.x

What the installer creates A system user named ollama (no login shell) owns the service process. Models are stored under /usr/share/ollama/.ollama/models/ when run as a service, or under ~/.ollama/models/ when run as your own user. The installer adds your current user to the ollama group so you can interact with the CLI without sudo.

Verifying the service

Check that the systemd unit is active and set to start on boot:

sudo systemctl status ollama

The output should show active (running) and enabled:

● ollama.service - Ollama Service 
    Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled) 
    Active: active (running) since ... Main PID: 12345 (ollama) 
     Tasks: 9 
    Memory: 30.2M 
       CPU: 210ms 
    CGroup: /system.slice/ollama.service 
                  └─12345 /usr/local/bin/ollama serve

Confirm the API is responding on localhost:

curl -fsS http://127.0.0.1:11434/api/version

{"version":"0.x.x"}

If either check fails, restart the service and check logs:

sudo systemctl restart ollama
sudo journalctl -u ollama -n 50 --no-pager

Pulling the Hermes 3 model

Ollama’s pull command downloads and caches the model. The default hermes3:8b tag fetches the Hermes 3 built on Llama 3.1 8B at Q4_K_M quantization – a good balance of output quality and memory footprint for CPU-only systems.

ollama pull hermes3:8b

The download is approximately 4.9 GB. Progress is shown inline.

Verify the model is listed locally:

ollama list

NAME ID SIZE MODIFIED hermes3:8b a1b2c3d4e5f6 4.9 GB X seconds ago

Alternative model sizes:

hermes3:8b is the recommended starting point. If your system has significantly more RAM and you want higher quality at the cost of speed,
hermes3:70b is available but requires 48+ GB of RAM for CPU-only inference and is impractically slow on most hardware without a GPU.

Configuring Ollama

Ollama reads configuration from environment variables. On systemd installations, the correct place to set these is a service override file, not your shell profile.

Common environment variables

Variable	Default	Purpose
`OLLAMA_HOST`	`127.0.0.1:11434`	Interface and port the API listens on
`OLLAMA_MODELS`	`/usr/share/ollama/.ollama/models`	Directory where model files are stored
`OLLAMA_NUM_PARALLEL`	`1`	Number of parallel requests to handle
`OLLAMA_MAX_LOADED_MODELS`	`1`	Maximum models to keep in memory simultaneously
`OLLAMA_KEEP_ALIVE`	`5m`	How long to keep a model loaded after the last request

Setting configuration via systemd override

Create an override directory for the Ollama service unit:

sudo mkdir -p /etc/systemd/system/ollama.service.d

Create the override file:

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep the model loaded for 10 minutes between requests
Environment="OLLAMA_KEEP_ALIVE=10m"
# Restrict to localhost (default, but explicit is better)
Environment="OLLAMA_HOST=127.0.0.1:11434"
EOF

Reload systemd and restart Ollama to apply changes:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Exposing Ollama on the network:

By default, Ollama only listens on 127.0.0.1 and is not accessible from other machines. If you need network access, set OLLAMA_HOST=0.0.0.0:11434 and restrict access at the firewall level with ufw. Never expose port 11434 publicly without authentication in front of it.

Creating a custom Modelfile

A Modelfile lets you customise how Hermes 3 behaves: inject a persistent system prompt, adjust inference parameters, or create a named variant. This is optional but useful in practice.

Create a working directory and write a Modelfile:

mkdir -p ~/ollama-models
cat > ~/ollama-models/Modelfile.hermes3 <<'EOF'
# Base model to extend
FROM hermes3:8b

# System prompt - sets persistent context for all conversations
SYSTEM """
You are a helpful and precise AI assistant. Answer questions directly
and thoroughly. When writing code, prefer readability over brevity.
When unsure, say so rather than guessing.
"""

# Inference parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
EOF

Build the custom variant under a local name:

ollama create hermes3-custom -f ~/ollama-models/Modelfile.hermes3

Confirm it appears in the model list alongside the base:

ollama list

Key Modelfile parameters

Parameter	Typical range	Effect
`temperature`	0.0 – 1.0	Higher values produce more varied output; lower values make it more deterministic
`top_p`	0.7 – 0.95	Controls nucleus sampling; lower values restrict to higher-probability tokens
`num_ctx`	2048 – 131072	Context window in tokens; larger values use more RAM
`repeat_penalty`	1.0 – 1.3	Penalises repeating the same tokens; values above 1.15 can hurt coherence

Testing the installation

The following tests move from basic connectivity through to a meaningful generation check. Run them in order; each one builds confidence before the next.

Test 1 -> Service health

Verify Ollama is running and the API responds:

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

Test 2 -> Model is listed

Confirm the model was downloaded and is registered:

ollama list

Expected: at least hermes3:8b appears in the output.

Test 3 -> Single-shot CLI inference

Send a one-shot prompt from the command line. This confirms the model loads and generates output without hanging:

ollama run hermes3:8b "Explain what a transformer architecture is in two sentences."

On a CPU-only system expect the first token in 5–30 seconds as the model loads into RAM, then streaming output at ~5–15 tokens per second depending on hardware. A response arriving at all means the model is functional.

Check memory usage during inference Run free -h in a second terminal while inference is happening to confirm the model fits comfortably in RAM. If the system starts swapping, responses will slow dramatically and you may need to close other processes.

Test 4 -> Interactive session

Open a multi-turn interactive session to test conversational context retention:

ollama run hermes3:8b

At the prompt, enter the following two messages in sequence:

>>> My name is Ivan. 
Nice to meet you, Ivan. How can I help you today?

>>> What is my name? 
Your name is Ivan.

Exit with /bye or Ctrl+D.

Test 5 -> Custom model variant

If you created the custom Modelfile in section 07, verify it runs and the system prompt is active:

ollama run hermes3-custom "Write a Python function to reverse a string."

Expect clean, readable code with comments – reflecting the system prompt instruction to prefer readability over brevity.

Test 6 -> Check model process state

After a successful inference, confirm the model is loaded in memory:

ollama ps

The PROCESSOR column will read 100% CPU on a CPU-only system. UNTIL reflects the OLLAMA_KEEP_ALIVE value set earlier. When this timer expires, the model is unloaded from RAM and reloads on the next request.

All tests passed?
If the service is active, the model is listed, single-shot and interactive prompts both produce sensible output, and ollama ps shows the model loaded, your installation is complete and working correctly.

Troubleshooting

Symptom	Likely cause	Fix
`curl` to `:11434` returns connection refused	Service not running	`sudo systemctl start ollama` then check `journalctl -u ollama -n 30`
`ollama` command not found after install	Shell PATH not updated	Log out and back in, or run `source ~/.bashrc`. The binary is at `/usr/local/bin/ollama`.
Model pull stalls or fails mid-download	Network interruption	Re-run `ollama pull hermes3:8b` – it resumes from where it stopped
Inference extremely slow (1 token/30s+)	System swapping; not enough free RAM	Close other applications; check `free -h` during inference. Consider a smaller context window in the Modelfile: `PARAMETER num_ctx 2048`
`ollama create` fails with “model not found”	Base model not pulled yet	Run `ollama pull hermes3:8b` before `ollama create`
Service unit changes not taking effect	Forgot to reload systemd	`sudo systemctl daemon-reload && sudo systemctl restart ollama`
`ollama ps` shows model unloaded after each request	`OLLAMA_KEEP_ALIVE` too short or set to `0`	Set `Environment="OLLAMA_KEEP_ALIVE=10m"` in the systemd override (see section 06)

Useful diagnostic commands

# Follow the service log in real time
sudo journalctl -u ollama -f

# Check available disk space for model storage
df -h /usr/share/ollama

# Check available RAM
free -h

# Inspect the installed service unit
cat /etc/systemd/system/ollama.service

# Remove a model if you need to free disk space
ollama rm hermes3:8b

Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04

Stack

Table of Contents

Overview and architecture

Prerequisites and hardware requirements

Install curl and zstd

Installing Ollama

Verifying the service

Pulling the Hermes 3 model

Configuring Ollama

Common environment variables

Setting configuration via systemd override

Creating a custom Modelfile

Key Modelfile parameters

Testing the installation

Test 1 -> Service health

Test 2 -> Model is listed

Test 3 -> Single-shot CLI inference

Test 4 -> Interactive session

Test 5 -> Custom model variant

Test 6 -> Check model process state

Troubleshooting

Useful diagnostic commands

Ivan Dabić

Ivan Dabić

Talk to Our Engineering Team

Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04

Stack

Table of Contents

Overview and architecture

Prerequisites and hardware requirements

Install curl and zstd

Installing Ollama

Verifying the service

Pulling the Hermes 3 model

Configuring Ollama

Common environment variables

Setting configuration via systemd override

Creating a custom Modelfile

Key Modelfile parameters

Testing the installation

Test 1 -> Service health

Test 2 -> Model is listed

Test 3 -> Single-shot CLI inference

Test 4 -> Interactive session

Test 5 -> Custom model variant

Test 6 -> Check model process state

Troubleshooting

Useful diagnostic commands

Ivan Dabić

Ivan Dabić

Talk to Our Engineering Team

Subscribe to our blog

Confirm Your Email Address