Installing and Configuring Ollama with Hermes 3 on Ubuntu 24.04 (NVIDIA GPU)

Stack

Prerequisites

  • Ubuntu 24.04 LTS, amd64
  • NVIDIA GPU (Kepler architecture or newer – GTX 600 series onwards). Compute capability 5.0+ required for CUDA.
  • At least 16 GB system RAM (separate from VRAM) – the OS, Ollama process, and KV cache overflow need headroom
  • Disk space: 10 GB minimum; 50 GB if pulling the 70B model
  • sudo access and internet connectivity

Table of Contents

  1. Overview and architecture
  2. VRAM tiers and model selection
  3. Prerequisites
  4. Installing NVIDIA drivers
  5. Installing Ollama
  6. Verifying GPU detection
  7. Pulling Hermes 3
  8. Configuring Ollama
  9. Layer offloading for VRAM-constrained cards
  10. Creating a custom Modelfile
  11. Testing the installation
  12. Troubleshooting

Overview and architecture

This guide covers running Hermes 3 locally on Ubuntu 24.04 with NVIDIA GPU acceleration via Ollama. Compared to the CPU-only setup, the difference is not incremental. Inference speeds of 40-80 tokens per second on a mid-range consumer GPU versus 5-15 on CPU represent a fundamentally different user experience.

ComponentRoleLocation
NVIDIA driverKernel module exposing the GPU to userspace. Required before anything else. Ollama will not detect the GPU without it./usr/lib/modules/
CUDA runtimeOllama bundles its own CUDA runtime libraries. No separate CUDA toolkit installation is needed.Bundled inside Ollama binary
Ollama binaryDownloads and serves models. Detects the NVIDIA driver at startup and offloads inference to the GPU automatically./usr/local/bin/ollama
Hermes 3 weightsGGUF model file. Loaded into VRAM on first inference request. Size and quantization vary by VRAM tier./usr/share/ollama/.ollama/models/

No CUDA toolkit needed. A common mistake is installing the full CUDA toolkit (several GB) before Ollama. This is unnecessary. Ollama ships with the CUDA runtime it needs. The only dependency is the NVIDIA kernel driver, which is covered in section 04.

VRAM tiers and model selection

On GPU, the bottleneck shifts from system RAM to VRAM. The model weights must fit in VRAM for full GPU acceleration. If they do not, Ollama silently splits layers between GPU and CPU (functional but significantly slower than pure GPU inference).

Hermes 3 model recommendations by VRAM tier:

6 GB
RTX 3060, RTX 4060
hermes3:8b (Q4_K_M)
Fits with ~1 GB headroom for KV cache. Keep num_ctx at 4096 or below.

8 GB
RTX 3070, RTX 4060 Ti
hermes3:8b (Q5_K_M)
Higher quality quantization at comfortable fit. Up to 8192 context.

12 GB
RTX 3080 12GB, RTX 4070
hermes3:8b (Q8_0)
Near full-precision quality. Large context window viable up to 32K.

16 GB
RTX 4080, RTX 3090 Ti
hermes3:8b (Q8_0)
Full quality 8B plus generous KV cache headroom. Extended context sessions comfortable.

24 GB
RTX 3090, RTX 4090
hermes3:70b (Q4_K_M)
The flagship Hermes experience. 70B at Q4_K_M fits in ~43 GB – use layer offloading to split across GPU + system RAM, or pair two cards.

If a model does not fully fit in VRAM, Ollama automatically offloads the remainder to system RAM with no warning in the terminal. You can be running what looks like GPU inference while a significant portion is actually on CPU. The only way to confirm full GPU utilisation is ollama ps, covered in the testing section. Inference speeds below 10 tok/s on a modern consumer GPU are a reliable indicator of partial offloading.

The table below maps quantizations to approximate VRAM requirements for Hermes 3 8B:

Model tagVRAM (weights)+ 8K KV cacheTotal VRAM needed
hermes3:8b (Q4_K_M)~4.9 GB~0.5 GB~5.4 GB
hermes3:8b-q5_K_M~5.7 GB~0.5 GB~6.2 GB
hermes3:8b-q8_0~8.5 GB~0.5 GB~9.0 GB
hermes3:70b (Q4_K_M)~43 GB~2.0 GB~45 GB

Verify the GPU is visible to the system

Before installing drivers, confirm Linux can see the card on the PCI bus:

lspci | grep -i 'nvidia\|vga'

If no NVIDIA device appears, the GPU may be seated incorrectly, the PCIe slot may be disabled in BIOS, or Secure Boot may be interfering. Address this before continuing, there is nothing driver installation can do about a GPU the kernel cannot see.

sudo apt update
sudo apt install -y curl zstd ubuntu-drivers-common

Installing NVIDIA drivers

Ubuntu’s ubuntu-drivers tool detects your GPU and installs the recommended proprietary driver from Ubuntu’s repository. This is the most reliable installation path on Ubuntu 24.04, it handles kernel module signing, initramfs updates, and Nouveau blacklisting automatically.

If Secure Boot is enabled in your BIOS, the NVIDIA kernel module must be signed to load. ubuntu-drivers handles this on Ubuntu 24.04 with MOK (Machine Owner Key) enrollment. You will be prompted to set a MOK password during installation and to enrol the key on the next reboot. If you skip the MOK enrolment step, the driver will install but fail to load. If Secure Boot is causing issues you cannot resolve, disabling it in BIOS is the fastest path forward for a dedicated inference server.

ubuntu-drivers devices
== /sys/bus/pci/devices/0000:01:00.0 == modalias : pci:v000010DEd00002503sv... vendor : NVIDIA Corporation model : GA106 [GeForce RTX 3060] driver : nvidia-driver-570 - distro non-free recommended driver : nvidia-driver-550 - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin
sudo ubuntu-drivers install
sudo reboot

Verify driver loaded correctly after reboot

nvidia-smi
+------------------------------------------------------------------------+ 
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x      | 
|-----------------------------------------+------------------------+-----+ 
| GPU Name Persistence-M      | Bus-Id Disp.A | Volatile Uncorr. ECC     | 
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage  | GPU-Util Compute M.      | |=========================================+========================+=====| 
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A             | 
| 30% 38C P8 12W / 170W         | 0MiB / 12288MiB.     | 0% Default      | 
+------------------------------------------------------------------------+

Key values to confirm: the driver version line is present, CUDA Version is shown, and Memory-Usage shows the card’s VRAM total. If nvidia-smi is not found or returns an error, the driver did not load check dmesg | grep -i nvidia for kernel errors.

Installing Ollama

The installation process is identical to the CPU setup. The difference is that Ollama detects the NVIDIA driver at startup and automatically uses the GPU – no additional configuration needed at this stage.

curl -fsSL https://ollama.com/install.sh | sh

The GPU detection line confirms Ollama found the driver. If it does not appear, revisit section 04.

ollama --version

Verifying GPU detection

Check that the Ollama service is running and verify it is seeing the GPU before pulling any models.

sudo systemctl status ollama
● ollama.service - Ollama Service 
    Loaded: loaded (/etc/systemd/system/ollama.service; enabled) 
    Active: active (running) since ... 
  Main PID: 12345 (ollama)

Force GPU detection logging to confirm Ollama sees the card:

# Temporarily start Ollama with debug output to check GPU detection
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram' | head -20

Stop the debug instance and restart the service normally:

# Ctrl+C to stop the debug instance, then
sudo systemctl start ollama
curl -fsS http://127.0.0.1:11434/api/version

No GPU lines in debug output? If OLLAMA_DEBUG=1 shows no GPU-related lines, Ollama is not detecting the driver. Common causes: the NVIDIA driver is installed but Nouveau is still loaded (check lsmod | grep nouveau), the ollama user is not in the video or render group, or the driver loaded with errors (check dmesg | grep -i nvidia).

Pulling Hermes 3

Pull the model matching your VRAM tier from the table in section 02. The commands below cover each tier explicitly.

6 GB VRAM – Q4_K_M (default tag)

ollama pull hermes3:8b

8 GB VRAM – Q5_K_M

ollama pull hermes3:8b-instruct-q5_K_M

12 GB+ VRAM – Q8_0

ollama pull hermes3:8b-instruct-q8_0

24 GB VRAM – 70B

ollama pull hermes3:70b

Confirm the model is registered locally:

ollama list

Configuring Ollama

GPU-specific configuration layers on top of the standard Ollama environment variables. Set everything in a systemd override file.

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep model loaded between requests
Environment="OLLAMA_KEEP_ALIVE=10m"

# Bind to localhost only (default but explicit)
Environment="OLLAMA_HOST=127.0.0.1:11434"

# Enable flash attention reduces VRAM for KV cache, speeds up long-context
Environment="OLLAMA_FLASH_ATTENTION=1"

# One model in VRAM at a time (increase if you have headroom)
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

GPU-specific environment variables

VariableDefaultPurpose
OLLAMA_FLASH_ATTENTION0Enables Flash Attention, reducing VRAM usage for the KV cache by 40-60% on supported GPUs. Enable on any Turing (RTX 20xx) or newer card.
OLLAMA_NUM_GPUAll layersNumber of model layers to load on GPU. Used for partial offloading. Covered in section 09.
OLLAMA_GPU_MEMORY_FRACTION1.0Fraction of VRAM Ollama is allowed to use. Set to 0.9 to reserve a buffer if the card is also driving a display.
CUDA_VISIBLE_DEVICESAll GPUsRestrict Ollama to specific GPU indices. Useful on multi-GPU systems.
OLLAMA_MAX_LOADED_MODELS1Models held in VRAM simultaneously. Increase only if multiple models fit.

GPU also driving a display? If your NVIDIA card is connected to a monitor, the display compositor consumes some VRAM. On a 6 GB card this can be 200-400 MB, which matters when fitting tight quantizations. Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to leave headroom and prevent the model from crowding out the display driver.

Layer offloading for VRAM-constrained cards

When a model is too large to fit entirely in VRAM, Ollama automatically splits transformer layers between GPU and CPU. GPU layers run fast; CPU layers run at memory-bandwidth speed. The split is transparent, Ollama does not warn you when it happens. The only signal is performance: token generation noticeably below the card’s capability.

Manual layer control gives you explicit authority over the split rather than relying on Ollama’s automatic calculation, which can sometimes be too conservative or too aggressive depending on how much VRAM the display driver is consuming.

Checking automatic layer assignment

After loading a model, inspect how many layers landed on GPU vs CPU:

# Load the model first
ollama run hermes3:8b "hello"

# Then check the split
ollama ps

Any CPU percentage in the PROCESSOR column means partial offloading is active. Check how many layers the 8B model has total:

ollama show hermes3:8b | grep -i 'layers\|param'

Setting manual layer count via Modelfile

Hermes 3 8B has 32 transformer layers. To put a specific number on GPU and the remainder on CPU:

cat > ~/ollama-models/Modelfile.hermes3-gpu <<'EOF'
FROM hermes3:8b

# Load 28 of 32 layers on GPU, remainder on CPU
# Adjust based on VRAM headroom reported by nvidia-smi
PARAMETER num_gpu 28

PARAMETER temperature    0.7
PARAMETER top_p          0.9
PARAMETER num_ctx        4096
PARAMETER repeat_penalty 1.1
EOF

ollama create hermes3-gpu -f ~/ollama-models/Modelfile.hermes3-gpu

Setting via environment variable

Alternatively, set the layer count globally via the systemd override. This applies to all models:

# Add to /etc/systemd/system/ollama.service.d/override.conf
# Replace 28 with the number appropriate for your VRAM
# Set to 999 to force all layers to GPU (Ollama caps at actual layer count)
Environment="OLLAMA_NUM_GPU=28"

10 Creating a custom Modelfile

With GPU acceleration, larger context windows are viable without the RAM penalty of CPU inference. The Modelfile below is tuned for a GPU setup.

mkdir -p ~/ollama-models

cat > ~/ollama-models/Modelfile.hermes3 <<'EOF'
# Adjust the FROM line to match the tag you pulled
FROM hermes3:8b-instruct-q8_0

SYSTEM """
You are a precise and knowledgeable technical assistant. Answer questions
directly and completely. When writing code, prefer clarity over brevity and
include comments explaining non-obvious decisions. When uncertain, say so.
"""

# Inference parameters
PARAMETER temperature    0.7
PARAMETER top_p          0.9
# Larger context window viable on GPU - adjust down for 6 GB VRAM cards
PARAMETER num_ctx        16384
PARAMETER repeat_penalty 1.1
EOF

ollama create hermes3-gpu -f ~/ollama-models/Modelfile.hermes3

Testing the installation

Test 1 -> Service and API health

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

Example response:

active{"version":"0.x.x"}

Test 2 -> Single-shot inference with timing

On GPU, the first token should arrive in under 5 seconds. Generation should be noticeably faster than CPU.

time ollama run hermes3:8b "Explain what CUDA is in two sentences."

Expected response:

Response in under 10 seconds total including model load time. Token generation at 30-80 tok/s depending on card and quantization.

Test 3 -> Confirm GPU is being used

This is the most important test. Run it immediately after an inference so the model is still loaded:

ollama ps

Example response:

NAME ID SIZE PROCESSOR UNTIL hermes3:8b a1b2c3d4e5f6 5.8 GB 100% GPU 9 minutes from now

100% GPU confirms the model is fully resident in VRAM with no CPU offloading. Any other percentage means partial offloading is active – see section “Layer offloading for VRAM-constrained cards“.

Test 4 -> VRAM usage during inference

Run this in a second terminal while inference is active to see live VRAM consumption:

watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

Example response:

NVIDIA GeForce RTX 3060, 6142 MiB, 146 MiB, 98 %

High GPU utilisation (80-99%) during active inference confirms the GPU is doing the work. Utilisation drops to near zero between tokens being generated – this is normal; the GPU works in bursts during the forward pass.

Test 5 -> Multi-turn context retention

ollama run hermes3:8b

Example conversation:

>>> My GPU has 12 GB of VRAM. 
Great, that gives you comfortable headroom for most 8B models and some 13B models at lower quantization.

>>> How much VRAM do I have? 
You have 12 GB of VRAM.

Exit with /bye or Ctrl+D.

Test 6 -> Custom model variant

ollama run hermes3-gpu "Write a Python function to batch process a list of file paths."

Expect clean, commented Python code reflecting the system prompt, generated at GPU speed.

All tests passing? Service active, model listed, ollama ps shows 100% GPU, nvidia-smi shows high utilisation during inference, multi-turn context works, and custom variant runs cleanly. The installation is complete.

Troubleshooting

SymptomLikely causeFix
nvidia-smi not found after rebootDriver did not loadCheck dmesg | grep -i nvidia. If Nouveau is still loaded: lsmod | grep nouveau. If present, confirm blacklist file exists at /etc/modprobe.d/blacklist-nvidia-nouveau.conf, run sudo update-initramfs -u, reboot.
Ollama installed but shows CPU-only in ollama psOllama installed before driver, or driver not detected at install timeReinstall Ollama after confirming nvidia-smi works: curl -fsSL https://ollama.com/install.sh | sh
GPU shown in debug but model still runs on CPUollama user not in render groupsudo usermod -aG render ollama && sudo systemctl restart ollama
Model loads but ollama ps shows partial CPU offloadModel does not fully fit in available VRAMUse a lower quantization, reduce num_ctx, or set OLLAMA_NUM_GPU manually. See section 09.
Token speed 5–15 tok/s despite GPU being detectedPartial layer offloading is silentRun ollama ps – if PROCESSOR shows any CPU%, you are offloading. Address VRAM fit.
VRAM usage spikes and crashes mid-inferenceContext window too large for VRAM budgetReduce num_ctx in Modelfile. Enable flash attention: OLLAMA_FLASH_ATTENTION=1.
Display artifacts or system freeze during inferenceGPU also driving display; VRAM overcommittedSet OLLAMA_GPU_MEMORY_FRACTION=0.85 to reserve display driver headroom.
Ollama does not detect GPU after kernel updateDKMS module not rebuilt for new kernelsudo apt install --reinstall nvidia-dkms-570 (replace version with installed driver)

Diagnostic commands

# Check GPU driver and VRAM state
nvidia-smi

# Live VRAM monitoring
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

# Check if Nouveau is still loaded (should be empty)
lsmod | grep nouveau

# Check kernel messages for NVIDIA driver errors
sudo dmesg | grep -i nvidia | tail -20

# Check ollama service logs
sudo journalctl -u ollama -n 50 --no-pager

# Force GPU detection debug output
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram'

# Check group membership for current user
groups

# Check DKMS module status
dkms status

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link