Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04 (NVIDIA GPU)

Stack

Prerequisites

Ubuntu 24.04 LTS, amd64
NVIDIA GPU (Kepler architecture or newer – GTX 600 series onwards). Compute capability 5.0+ required for CUDA.
At least 16 GB system RAM (separate from VRAM) – the OS, Ollama process, and KV cache overflow need headroom
Disk space: 10 GB minimum; 50 GB if pulling the 70B model
sudo access and internet connectivity

Overview and architecture
VRAM tiers and model selection
Prerequisites
Installing NVIDIA drivers
Installing Ollama
Verifying GPU detection
Pulling Hermes 3
Configuring Ollama
Layer offloading for VRAM-constrained cards
Creating a custom Modelfile
Testing the installation
Troubleshooting

Overview and architecture

This guide covers running Hermes 3 locally on Ubuntu 24.04 with NVIDIA GPU acceleration via Ollama. Compared to the CPU-only setup, the difference is not incremental. Inference speeds of 40-80 tokens per second on a mid-range consumer GPU versus 5-15 on CPU represent a fundamentally different user experience.

Component	Role	Location
NVIDIA driver	Kernel module exposing the GPU to userspace. Required before anything else. Ollama will not detect the GPU without it.	`/usr/lib/modules/`
CUDA runtime	Ollama bundles its own CUDA runtime libraries. No separate CUDA toolkit installation is needed.	Bundled inside Ollama binary
Ollama binary	Downloads and serves models. Detects the NVIDIA driver at startup and offloads inference to the GPU automatically.	`/usr/local/bin/ollama`
Hermes 3 weights	GGUF model file. Loaded into VRAM on first inference request. Size and quantization vary by VRAM tier.	`/usr/share/ollama/.ollama/models/`

No CUDA toolkit needed. A common mistake is installing the full CUDA toolkit (several GB) before Ollama. This is unnecessary. Ollama ships with the CUDA runtime it needs. The only dependency is the NVIDIA kernel driver, which is covered in section 04.

VRAM tiers and model selection

On GPU, the bottleneck shifts from system RAM to VRAM. The model weights must fit in VRAM for full GPU acceleration. If they do not, Ollama silently splits layers between GPU and CPU (functional but significantly slower than pure GPU inference).

Hermes 3 model recommendations by VRAM tier:

6 GB
RTX 3060, RTX 4060
hermes3:8b (Q4_K_M)
Fits with ~1 GB headroom for KV cache. Keep num_ctx at 4096 or below.

8 GB
RTX 3070, RTX 4060 Ti
hermes3:8b (Q5_K_M)
Higher quality quantization at comfortable fit. Up to 8192 context.

12 GB
RTX 3080 12GB, RTX 4070
hermes3:8b (Q8_0)
Near full-precision quality. Large context window viable up to 32K.

16 GB
RTX 4080, RTX 3090 Ti
hermes3:8b (Q8_0)
Full quality 8B plus generous KV cache headroom. Extended context sessions comfortable.

24 GB
RTX 3090, RTX 4090
hermes3:70b (Q4_K_M)
The flagship Hermes experience. 70B at Q4_K_M fits in ~43 GB – use layer offloading to split across GPU + system RAM, or pair two cards.

If a model does not fully fit in VRAM, Ollama automatically offloads the remainder to system RAM with no warning in the terminal. You can be running what looks like GPU inference while a significant portion is actually on CPU. The only way to confirm full GPU utilisation is ollama ps, covered in the testing section. Inference speeds below 10 tok/s on a modern consumer GPU are a reliable indicator of partial offloading.

The table below maps quantizations to approximate VRAM requirements for Hermes 3 8B:

Model tag	VRAM (weights)	+ 8K KV cache	Total VRAM needed
`hermes3:8b` (Q4_K_M)	~4.9 GB	~0.5 GB	~5.4 GB
`hermes3:8b-q5_K_M`	~5.7 GB	~0.5 GB	~6.2 GB
`hermes3:8b-q8_0`	~8.5 GB	~0.5 GB	~9.0 GB
`hermes3:70b` (Q4_K_M)	~43 GB	~2.0 GB	~45 GB

Verify the GPU is visible to the system

Before installing drivers, confirm Linux can see the card on the PCI bus:

lspci | grep -i 'nvidia\|vga'

If no NVIDIA device appears, the GPU may be seated incorrectly, the PCIe slot may be disabled in BIOS, or Secure Boot may be interfering. Address this before continuing, there is nothing driver installation can do about a GPU the kernel cannot see.

sudo apt update
sudo apt install -y curl zstd ubuntu-drivers-common

Installing NVIDIA drivers

Ubuntu’s ubuntu-drivers tool detects your GPU and installs the recommended proprietary driver from Ubuntu’s repository. This is the most reliable installation path on Ubuntu 24.04, it handles kernel module signing, initramfs updates, and Nouveau blacklisting automatically.

If Secure Boot is enabled in your BIOS, the NVIDIA kernel module must be signed to load. ubuntu-drivers handles this on Ubuntu 24.04 with MOK (Machine Owner Key) enrollment. You will be prompted to set a MOK password during installation and to enrol the key on the next reboot. If you skip the MOK enrolment step, the driver will install but fail to load. If Secure Boot is causing issues you cannot resolve, disabling it in BIOS is the fastest path forward for a dedicated inference server.

Check the recommended driver before installing

ubuntu-drivers devices

== /sys/bus/pci/devices/0000:01:00.0 == modalias : pci:v000010DEd00002503sv... vendor : NVIDIA Corporation model : GA106 [GeForce RTX 3060] driver : nvidia-driver-570 - distro non-free recommended driver : nvidia-driver-550 - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin

Install the recommended driver

sudo ubuntu-drivers install
sudo reboot

Verify driver loaded correctly after reboot

nvidia-smi

+------------------------------------------------------------------------+ 
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x      | 
|-----------------------------------------+------------------------+-----+ 
| GPU Name Persistence-M      | Bus-Id Disp.A | Volatile Uncorr. ECC     | 
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage  | GPU-Util Compute M.      | |=========================================+========================+=====| 
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A             | 
| 30% 38C P8 12W / 170W         | 0MiB / 12288MiB.     | 0% Default      | 
+------------------------------------------------------------------------+

Key values to confirm: the driver version line is present, CUDA Version is shown, and Memory-Usage shows the card’s VRAM total. If nvidia-smi is not found or returns an error, the driver did not load check dmesg | grep -i nvidia for kernel errors.

Nouveau conflit check

The proprietary NVIDIA driver and the open-source Nouveau driver cannot coexist. ubuntu-drivers blacklists Nouveau automatically. If you previously used Nouveau and the NVIDIA driver fails to load after reboot, run cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf to confirm the blacklist file exists, then sudo update-initramfs -u and reboot again.

Installing Ollama

The installation process is identical to the CPU setup. The difference is that Ollama detects the NVIDIA driver at startup and automatically uses the GPU – no additional configuration needed at this stage.

Driver-first is mandatory

Install the NVIDIA driver and reboot before running the Ollama installer. Ollama detects the GPU during installation and configures itself accordingly. Installing Ollama before the driver means it sets itself up for CPU-only operation and may not pick up the GPU correctly even after the driver is added later.

curl -fsSL https://ollama.com/install.sh | sh

The GPU detection line confirms Ollama found the driver. If it does not appear, revisit section 04.

ollama --version

Verifying GPU detection

Check that the Ollama service is running and verify it is seeing the GPU before pulling any models.

sudo systemctl status ollama

● ollama.service - Ollama Service 
    Loaded: loaded (/etc/systemd/system/ollama.service; enabled) 
    Active: active (running) since ... 
  Main PID: 12345 (ollama)

Force GPU detection logging to confirm Ollama sees the card:

# Temporarily start Ollama with debug output to check GPU detection
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram' | head -20

Stop the debug instance and restart the service normally:

# Ctrl+C to stop the debug instance, then
sudo systemctl start ollama
curl -fsS http://127.0.0.1:11434/api/version

No GPU lines in debug output? If OLLAMA_DEBUG=1 shows no GPU-related lines, Ollama is not detecting the driver. Common causes: the NVIDIA driver is installed but Nouveau is still loaded (check lsmod | grep nouveau), the ollama user is not in the video or render group, or the driver loaded with errors (check dmesg | grep -i nvidia).

Pulling Hermes 3

Pull the model matching your VRAM tier from the table in section 02. The commands below cover each tier explicitly.

6 GB VRAM – Q4_K_M (default tag)

ollama pull hermes3:8b

8 GB VRAM – Q5_K_M

ollama pull hermes3:8b-instruct-q5_K_M

12 GB+ VRAM – Q8_0

ollama pull hermes3:8b-instruct-q8_0

24 GB VRAM – 70B

ollama pull hermes3:70b

70B on 24 GB VRAM

The 70B model at Q4_K_M requires approximately 43 GB for weights alone. A single 24 GB card cannot hold it entirely in VRAM. Ollama will automatically offload the excess layers to system RAM. This still runs significantly faster than CPU-only inference but is not true full-GPU operation. See section 09 on layer offloading for how to control and monitor this split.

Confirm the model is registered locally:

ollama list

Configuring Ollama

GPU-specific configuration layers on top of the standard Ollama environment variables. Set everything in a systemd override file.

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep model loaded between requests
Environment="OLLAMA_KEEP_ALIVE=10m"

# Bind to localhost only (default but explicit)
Environment="OLLAMA_HOST=127.0.0.1:11434"

# Enable flash attention reduces VRAM for KV cache, speeds up long-context
Environment="OLLAMA_FLASH_ATTENTION=1"

# One model in VRAM at a time (increase if you have headroom)
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

GPU-specific environment variables

Variable	Default	Purpose
`OLLAMA_FLASH_ATTENTION`	`0`	Enables Flash Attention, reducing VRAM usage for the KV cache by 40-60% on supported GPUs. Enable on any Turing (RTX 20xx) or newer card.
`OLLAMA_NUM_GPU`	All layers	Number of model layers to load on GPU. Used for partial offloading. Covered in section 09.
`OLLAMA_GPU_MEMORY_FRACTION`	`1.0`	Fraction of VRAM Ollama is allowed to use. Set to `0.9` to reserve a buffer if the card is also driving a display.
`CUDA_VISIBLE_DEVICES`	All GPUs	Restrict Ollama to specific GPU indices. Useful on multi-GPU systems.
`OLLAMA_MAX_LOADED_MODELS`	`1`	Models held in VRAM simultaneously. Increase only if multiple models fit.

GPU also driving a display? If your NVIDIA card is connected to a monitor, the display compositor consumes some VRAM. On a 6 GB card this can be 200-400 MB, which matters when fitting tight quantizations. Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to leave headroom and prevent the model from crowding out the display driver.

Layer offloading for VRAM-constrained cards

When a model is too large to fit entirely in VRAM, Ollama automatically splits transformer layers between GPU and CPU. GPU layers run fast; CPU layers run at memory-bandwidth speed. The split is transparent, Ollama does not warn you when it happens. The only signal is performance: token generation noticeably below the card’s capability.

Manual layer control gives you explicit authority over the split rather than relying on Ollama’s automatic calculation, which can sometimes be too conservative or too aggressive depending on how much VRAM the display driver is consuming.

Checking automatic layer assignment

After loading a model, inspect how many layers landed on GPU vs CPU:

# Load the model first
ollama run hermes3:8b "hello"

# Then check the split
ollama ps

Any CPU percentage in the PROCESSOR column means partial offloading is active. Check how many layers the 8B model has total:

ollama show hermes3:8b | grep -i 'layers\|param'

Setting manual layer count via Modelfile

Hermes 3 8B has 32 transformer layers. To put a specific number on GPU and the remainder on CPU:

cat > ~/ollama-models/Modelfile.hermes3-gpu <<'EOF'
FROM hermes3:8b

# Load 28 of 32 layers on GPU, remainder on CPU
# Adjust based on VRAM headroom reported by nvidia-smi
PARAMETER num_gpu 28

PARAMETER temperature    0.7
PARAMETER top_p          0.9
PARAMETER num_ctx        4096
PARAMETER repeat_penalty 1.1
EOF

ollama create hermes3-gpu -f ~/ollama-models/Modelfile.hermes3-gpu

Setting via environment variable

Alternatively, set the layer count globally via the systemd override. This applies to all models:

# Add to /etc/systemd/system/ollama.service.d/override.conf
# Replace 28 with the number appropriate for your VRAM
# Set to 999 to force all layers to GPU (Ollama caps at actual layer count)
Environment="OLLAMA_NUM_GPU=28"

Finding the right number

Start with OLLAMA_NUM_GPU=999 to attempt full GPU loading. If the model loads successfully, you are done. If Ollama reports insufficient VRAM, reduce by increments of 4 until the model loads. Monitor with nvidia-smi to see how much VRAM each configuration consumes.

Partial offload performance penalty

Every layer boundary between GPU and CPU introduces a data transfer over PCIe. Performance does not degrade linearly – having even 4 of 32 layers on CPU can cut throughput by 30-50% compared to full GPU operation, because every forward pass must synchronise across the boundary. If partial offloading is unavoidable, putting as many layers as possible on GPU is more important than the exact split.

10 Creating a custom Modelfile

With GPU acceleration, larger context windows are viable without the RAM penalty of CPU inference. The Modelfile below is tuned for a GPU setup.

mkdir -p ~/ollama-models

cat > ~/ollama-models/Modelfile.hermes3 <<'EOF'
# Adjust the FROM line to match the tag you pulled
FROM hermes3:8b-instruct-q8_0

SYSTEM """
You are a precise and knowledgeable technical assistant. Answer questions
directly and completely. When writing code, prefer clarity over brevity and
include comments explaining non-obvious decisions. When uncertain, say so.
"""

# Inference parameters
PARAMETER temperature    0.7
PARAMETER top_p          0.9
# Larger context window viable on GPU - adjust down for 6 GB VRAM cards
PARAMETER num_ctx        16384
PARAMETER repeat_penalty 1.1
EOF

ollama create hermes3-gpu -f ~/ollama-models/Modelfile.hermes3

Context window and VRAM

On GPU, the KV cache for the context window lives in VRAM, not system RAM. A num_ctx of 16384 consumes approximately 1 GB of extra VRAM on an 8B model. On a 6 GB card running Q4_K_M, this leaves very little headroom. Keep num_ctx at 4096 for 6 GB cards and scale up with available VRAM.

Testing the installation

Test 1 -> Service and API health

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

Example response:

active{"version":"0.x.x"}

Test 2 -> Single-shot inference with timing

On GPU, the first token should arrive in under 5 seconds. Generation should be noticeably faster than CPU.

time ollama run hermes3:8b "Explain what CUDA is in two sentences."

Expected response:

Response in under 10 seconds total including model load time. Token generation at 30-80 tok/s depending on card and quantization.

Test 3 -> Confirm GPU is being used

This is the most important test. Run it immediately after an inference so the model is still loaded:

ollama ps

Example response:

NAME ID SIZE PROCESSOR UNTIL hermes3:8b a1b2c3d4e5f6 5.8 GB 100% GPU 9 minutes from now

100% GPU confirms the model is fully resident in VRAM with no CPU offloading. Any other percentage means partial offloading is active – see section “Layer offloading for VRAM-constrained cards“.

Test 4 -> VRAM usage during inference

Run this in a second terminal while inference is active to see live VRAM consumption:

watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

Example response:

NVIDIA GeForce RTX 3060, 6142 MiB, 146 MiB, 98 %

High GPU utilisation (80-99%) during active inference confirms the GPU is doing the work. Utilisation drops to near zero between tokens being generated – this is normal; the GPU works in bursts during the forward pass.

Test 5 -> Multi-turn context retention

ollama run hermes3:8b

Example conversation:

>>> My GPU has 12 GB of VRAM. 
Great, that gives you comfortable headroom for most 8B models and some 13B models at lower quantization.

>>> How much VRAM do I have? 
You have 12 GB of VRAM.

Exit with /bye or Ctrl+D.

Test 6 -> Custom model variant

ollama run hermes3-gpu "Write a Python function to batch process a list of file paths."

Expect clean, commented Python code reflecting the system prompt, generated at GPU speed.

All tests passing? Service active, model listed, ollama ps shows 100% GPU, nvidia-smi shows high utilisation during inference, multi-turn context works, and custom variant runs cleanly. The installation is complete.

Troubleshooting

Symptom	Likely cause	Fix
`nvidia-smi` not found after reboot	Driver did not load	Check `dmesg \| grep -i nvidia`. If Nouveau is still loaded: `lsmod \| grep nouveau`. If present, confirm blacklist file exists at `/etc/modprobe.d/blacklist-nvidia-nouveau.conf`, run `sudo update-initramfs -u`, reboot.
Ollama installed but shows CPU-only in `ollama ps`	Ollama installed before driver, or driver not detected at install time	Reinstall Ollama after confirming `nvidia-smi` works: `curl -fsSL https://ollama.com/install.sh \| sh`
GPU shown in debug but model still runs on CPU	`ollama` user not in `render` group	`sudo usermod -aG render ollama && sudo systemctl restart ollama`
Model loads but `ollama ps` shows partial CPU offload	Model does not fully fit in available VRAM	Use a lower quantization, reduce `num_ctx`, or set `OLLAMA_NUM_GPU` manually. See section 09.
Token speed 5–15 tok/s despite GPU being detected	Partial layer offloading is silent	Run `ollama ps` – if PROCESSOR shows any CPU%, you are offloading. Address VRAM fit.
VRAM usage spikes and crashes mid-inference	Context window too large for VRAM budget	Reduce `num_ctx` in Modelfile. Enable flash attention: `OLLAMA_FLASH_ATTENTION=1`.
Display artifacts or system freeze during inference	GPU also driving display; VRAM overcommitted	Set `OLLAMA_GPU_MEMORY_FRACTION=0.85` to reserve display driver headroom.
Ollama does not detect GPU after kernel update	DKMS module not rebuilt for new kernel	`sudo apt install --reinstall nvidia-dkms-570` (replace version with installed driver)

Diagnostic commands

# Check GPU driver and VRAM state
nvidia-smi

# Live VRAM monitoring
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

# Check if Nouveau is still loaded (should be empty)
lsmod | grep nouveau

# Check kernel messages for NVIDIA driver errors
sudo dmesg | grep -i nvidia | tail -20

# Check ollama service logs
sudo journalctl -u ollama -n 50 --no-pager

# Force GPU detection debug output
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram'

# Check group membership for current user
groups

# Check DKMS module status
dkms status

Editor’s note

Editor’s note: If for any reason you are limited by hardware, and your computing power boils down to CPU, >>this article<< may be of assistance.

Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04 (NVIDIA GPU)

Stack

Prerequisites

Table of Contents

Overview and architecture

VRAM tiers and model selection

Hermes 3 model recommendations by VRAM tier:

Verify the GPU is visible to the system

Installing NVIDIA drivers

Check the recommended driver before installing

Install the recommended driver

Verify driver loaded correctly after reboot

Nouveau conflit check

Installing Ollama

Driver-first is mandatory

Verifying GPU detection

Pulling Hermes 3

6 GB VRAM – Q4_K_M (default tag)

8 GB VRAM – Q5_K_M

12 GB+ VRAM – Q8_0

24 GB VRAM – 70B

70B on 24 GB VRAM

Configuring Ollama

GPU-specific environment variables

Layer offloading for VRAM-constrained cards

Checking automatic layer assignment

Setting manual layer count via Modelfile

Setting via environment variable

Finding the right number

Partial offload performance penalty

10 Creating a custom Modelfile

Context window and VRAM

Testing the installation

Test 1 -> Service and API health

Test 2 -> Single-shot inference with timing

Test 3 -> Confirm GPU is being used

Test 4 -> VRAM usage during inference

Test 5 -> Multi-turn context retention

Test 6 -> Custom model variant

Troubleshooting

Diagnostic commands

Editor’s note

Ivan Dabić

Ivan Dabić

Book a Discovery Call With Our Team

Subscribe to our blog

Confirm Your Email Address