Stack
Prerequisites
- Ubuntu 24.04 LTS, amd64
- NVIDIA GPU (Kepler architecture or newer – GTX 600 series onwards). Compute capability 5.0+ required for CUDA.
- At least 16 GB system RAM (separate from VRAM) – the OS, Ollama process, and KV cache overflow need headroom
- Disk space: 10 GB minimum; 50 GB if pulling the 70B model
sudoaccess and internet connectivity
Table of Contents
- Overview and architecture
- VRAM tiers and model selection
- Prerequisites
- Installing NVIDIA drivers
- Installing Ollama
- Verifying GPU detection
- Pulling Hermes 3
- Configuring Ollama
- Layer offloading for VRAM-constrained cards
- Creating a custom Modelfile
- Testing the installation
- Troubleshooting
Overview and architecture
This guide covers running Hermes 3 locally on Ubuntu 24.04 with NVIDIA GPU acceleration via Ollama. Compared to the CPU-only setup, the difference is not incremental. Inference speeds of 40-80 tokens per second on a mid-range consumer GPU versus 5-15 on CPU represent a fundamentally different user experience.
| Component | Role | Location |
|---|---|---|
| NVIDIA driver | Kernel module exposing the GPU to userspace. Required before anything else. Ollama will not detect the GPU without it. | /usr/lib/modules/ |
| CUDA runtime | Ollama bundles its own CUDA runtime libraries. No separate CUDA toolkit installation is needed. | Bundled inside Ollama binary |
| Ollama binary | Downloads and serves models. Detects the NVIDIA driver at startup and offloads inference to the GPU automatically. | /usr/local/bin/ollama |
| Hermes 3 weights | GGUF model file. Loaded into VRAM on first inference request. Size and quantization vary by VRAM tier. | /usr/share/ollama/.ollama/models/ |
No CUDA toolkit needed. A common mistake is installing the full CUDA toolkit (several GB) before Ollama. This is unnecessary. Ollama ships with the CUDA runtime it needs. The only dependency is the NVIDIA kernel driver, which is covered in section 04.
VRAM tiers and model selection
On GPU, the bottleneck shifts from system RAM to VRAM. The model weights must fit in VRAM for full GPU acceleration. If they do not, Ollama silently splits layers between GPU and CPU (functional but significantly slower than pure GPU inference).
Hermes 3 model recommendations by VRAM tier:
6 GB
RTX 3060, RTX 4060
hermes3:8b (Q4_K_M)
Fits with ~1 GB headroom for KV cache. Keep num_ctx at 4096 or below.
8 GB
RTX 3070, RTX 4060 Ti
hermes3:8b (Q5_K_M)
Higher quality quantization at comfortable fit. Up to 8192 context.
12 GB
RTX 3080 12GB, RTX 4070
hermes3:8b (Q8_0)
Near full-precision quality. Large context window viable up to 32K.
16 GB
RTX 4080, RTX 3090 Ti
hermes3:8b (Q8_0)
Full quality 8B plus generous KV cache headroom. Extended context sessions comfortable.
24 GB
RTX 3090, RTX 4090
hermes3:70b (Q4_K_M)
The flagship Hermes experience. 70B at Q4_K_M fits in ~43 GB – use layer offloading to split across GPU + system RAM, or pair two cards.
If a model does not fully fit in VRAM, Ollama automatically offloads the remainder to system RAM with no warning in the terminal. You can be running what looks like GPU inference while a significant portion is actually on CPU. The only way to confirm full GPU utilisation is ollama ps, covered in the testing section. Inference speeds below 10 tok/s on a modern consumer GPU are a reliable indicator of partial offloading.
The table below maps quantizations to approximate VRAM requirements for Hermes 3 8B:
| Model tag | VRAM (weights) | + 8K KV cache | Total VRAM needed |
|---|---|---|---|
hermes3:8b (Q4_K_M) | ~4.9 GB | ~0.5 GB | ~5.4 GB |
hermes3:8b-q5_K_M | ~5.7 GB | ~0.5 GB | ~6.2 GB |
hermes3:8b-q8_0 | ~8.5 GB | ~0.5 GB | ~9.0 GB |
hermes3:70b (Q4_K_M) | ~43 GB | ~2.0 GB | ~45 GB |
Verify the GPU is visible to the system
Before installing drivers, confirm Linux can see the card on the PCI bus:
lspci | grep -i 'nvidia\|vga'If no NVIDIA device appears, the GPU may be seated incorrectly, the PCIe slot may be disabled in BIOS, or Secure Boot may be interfering. Address this before continuing, there is nothing driver installation can do about a GPU the kernel cannot see.
sudo apt update
sudo apt install -y curl zstd ubuntu-drivers-commonInstalling NVIDIA drivers
Ubuntu’s ubuntu-drivers tool detects your GPU and installs the recommended proprietary driver from Ubuntu’s repository. This is the most reliable installation path on Ubuntu 24.04, it handles kernel module signing, initramfs updates, and Nouveau blacklisting automatically.
If Secure Boot is enabled in your BIOS, the NVIDIA kernel module must be signed to load.
ubuntu-drivershandles this on Ubuntu 24.04 with MOK (Machine Owner Key) enrollment. You will be prompted to set a MOK password during installation and to enrol the key on the next reboot. If you skip the MOK enrolment step, the driver will install but fail to load. If Secure Boot is causing issues you cannot resolve, disabling it in BIOS is the fastest path forward for a dedicated inference server.
Check the recommended driver before installing
ubuntu-drivers devices== /sys/bus/pci/devices/0000:01:00.0 == modalias : pci:v000010DEd00002503sv... vendor : NVIDIA Corporation model : GA106 [GeForce RTX 3060] driver : nvidia-driver-570 - distro non-free recommended driver : nvidia-driver-550 - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin
Install the recommended driver
sudo ubuntu-drivers install
sudo rebootVerify driver loaded correctly after reboot
nvidia-smi+------------------------------------------------------------------------+
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x |
|-----------------------------------------+------------------------+-----+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |=========================================+========================+=====|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 30% 38C P8 12W / 170W | 0MiB / 12288MiB. | 0% Default |
+------------------------------------------------------------------------+
Key values to confirm: the driver version line is present, CUDA Version is shown, and Memory-Usage shows the card’s VRAM total. If nvidia-smi is not found or returns an error, the driver did not load check dmesg | grep -i nvidia for kernel errors.
Nouveau conflit check
The proprietary NVIDIA driver and the open-source Nouveau driver cannot coexist. ubuntu-drivers blacklists Nouveau automatically. If you previously used Nouveau and the NVIDIA driver fails to load after reboot, run cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf to confirm the blacklist file exists, then sudo update-initramfs -u and reboot again.
Installing Ollama
The installation process is identical to the CPU setup. The difference is that Ollama detects the NVIDIA driver at startup and automatically uses the GPU – no additional configuration needed at this stage.
Driver-first is mandatory
Install the NVIDIA driver and reboot before running the Ollama installer. Ollama detects the GPU during installation and configures itself accordingly. Installing Ollama before the driver means it sets itself up for CPU-only operation and may not pick up the GPU correctly even after the driver is added later.
curl -fsSL https://ollama.com/install.sh | shThe GPU detection line confirms Ollama found the driver. If it does not appear, revisit section 04.
ollama --versionVerifying GPU detection
Check that the Ollama service is running and verify it is seeing the GPU before pulling any models.
sudo systemctl status ollama● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled)
Active: active (running) since ...
Main PID: 12345 (ollama)
Force GPU detection logging to confirm Ollama sees the card:
# Temporarily start Ollama with debug output to check GPU detection
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram' | head -20Stop the debug instance and restart the service normally:
# Ctrl+C to stop the debug instance, then
sudo systemctl start ollama
curl -fsS http://127.0.0.1:11434/api/versionNo GPU lines in debug output? If OLLAMA_DEBUG=1 shows no GPU-related lines, Ollama is not detecting the driver. Common causes: the NVIDIA driver is installed but Nouveau is still loaded (check lsmod | grep nouveau), the ollama user is not in the video or render group, or the driver loaded with errors (check dmesg | grep -i nvidia).
Pulling Hermes 3
Pull the model matching your VRAM tier from the table in section 02. The commands below cover each tier explicitly.
6 GB VRAM – Q4_K_M (default tag)
ollama pull hermes3:8b8 GB VRAM – Q5_K_M
ollama pull hermes3:8b-instruct-q5_K_M12 GB+ VRAM – Q8_0
ollama pull hermes3:8b-instruct-q8_024 GB VRAM – 70B
ollama pull hermes3:70b70B on 24 GB VRAM
The 70B model at Q4_K_M requires approximately 43 GB for weights alone. A single 24 GB card cannot hold it entirely in VRAM. Ollama will automatically offload the excess layers to system RAM. This still runs significantly faster than CPU-only inference but is not true full-GPU operation. See section 09 on layer offloading for how to control and monitor this split.
Confirm the model is registered locally:
ollama listConfiguring Ollama
GPU-specific configuration layers on top of the standard Ollama environment variables. Set everything in a systemd override file.
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep model loaded between requests
Environment="OLLAMA_KEEP_ALIVE=10m"
# Bind to localhost only (default but explicit)
Environment="OLLAMA_HOST=127.0.0.1:11434"
# Enable flash attention reduces VRAM for KV cache, speeds up long-context
Environment="OLLAMA_FLASH_ATTENTION=1"
# One model in VRAM at a time (increase if you have headroom)
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollamaGPU-specific environment variables
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_FLASH_ATTENTION | 0 | Enables Flash Attention, reducing VRAM usage for the KV cache by 40-60% on supported GPUs. Enable on any Turing (RTX 20xx) or newer card. |
OLLAMA_NUM_GPU | All layers | Number of model layers to load on GPU. Used for partial offloading. Covered in section 09. |
OLLAMA_GPU_MEMORY_FRACTION | 1.0 | Fraction of VRAM Ollama is allowed to use. Set to 0.9 to reserve a buffer if the card is also driving a display. |
CUDA_VISIBLE_DEVICES | All GPUs | Restrict Ollama to specific GPU indices. Useful on multi-GPU systems. |
OLLAMA_MAX_LOADED_MODELS | 1 | Models held in VRAM simultaneously. Increase only if multiple models fit. |
GPU also driving a display? If your NVIDIA card is connected to a monitor, the display compositor consumes some VRAM. On a 6 GB card this can be 200-400 MB, which matters when fitting tight quantizations. Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to leave headroom and prevent the model from crowding out the display driver.
Layer offloading for VRAM-constrained cards
When a model is too large to fit entirely in VRAM, Ollama automatically splits transformer layers between GPU and CPU. GPU layers run fast; CPU layers run at memory-bandwidth speed. The split is transparent, Ollama does not warn you when it happens. The only signal is performance: token generation noticeably below the card’s capability.
Manual layer control gives you explicit authority over the split rather than relying on Ollama’s automatic calculation, which can sometimes be too conservative or too aggressive depending on how much VRAM the display driver is consuming.
Checking automatic layer assignment
After loading a model, inspect how many layers landed on GPU vs CPU:
# Load the model first
ollama run hermes3:8b "hello"
# Then check the split
ollama psAny CPU percentage in the PROCESSOR column means partial offloading is active. Check how many layers the 8B model has total:
ollama show hermes3:8b | grep -i 'layers\|param'Setting manual layer count via Modelfile
Hermes 3 8B has 32 transformer layers. To put a specific number on GPU and the remainder on CPU:
cat > ~/ollama-models/Modelfile.hermes3-gpu <<'EOF'
FROM hermes3:8b
# Load 28 of 32 layers on GPU, remainder on CPU
# Adjust based on VRAM headroom reported by nvidia-smi
PARAMETER num_gpu 28
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.1
EOF
ollama create hermes3-gpu -f ~/ollama-models/Modelfile.hermes3-gpuSetting via environment variable
Alternatively, set the layer count globally via the systemd override. This applies to all models:
# Add to /etc/systemd/system/ollama.service.d/override.conf
# Replace 28 with the number appropriate for your VRAM
# Set to 999 to force all layers to GPU (Ollama caps at actual layer count)
Environment="OLLAMA_NUM_GPU=28"Finding the right number
Start with OLLAMA_NUM_GPU=999 to attempt full GPU loading. If the model loads successfully, you are done. If Ollama reports insufficient VRAM, reduce by increments of 4 until the model loads. Monitor with nvidia-smi to see how much VRAM each configuration consumes.
Partial offload performance penalty
Every layer boundary between GPU and CPU introduces a data transfer over PCIe. Performance does not degrade linearly – having even 4 of 32 layers on CPU can cut throughput by 30-50% compared to full GPU operation, because every forward pass must synchronise across the boundary. If partial offloading is unavoidable, putting as many layers as possible on GPU is more important than the exact split.
10 Creating a custom Modelfile
With GPU acceleration, larger context windows are viable without the RAM penalty of CPU inference. The Modelfile below is tuned for a GPU setup.
mkdir -p ~/ollama-models
cat > ~/ollama-models/Modelfile.hermes3 <<'EOF'
# Adjust the FROM line to match the tag you pulled
FROM hermes3:8b-instruct-q8_0
SYSTEM """
You are a precise and knowledgeable technical assistant. Answer questions
directly and completely. When writing code, prefer clarity over brevity and
include comments explaining non-obvious decisions. When uncertain, say so.
"""
# Inference parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
# Larger context window viable on GPU - adjust down for 6 GB VRAM cards
PARAMETER num_ctx 16384
PARAMETER repeat_penalty 1.1
EOF
ollama create hermes3-gpu -f ~/ollama-models/Modelfile.hermes3Context window and VRAM
On GPU, the KV cache for the context window lives in VRAM, not system RAM. A num_ctx of 16384 consumes approximately 1 GB of extra VRAM on an 8B model. On a 6 GB card running Q4_K_M, this leaves very little headroom. Keep num_ctx at 4096 for 6 GB cards and scale up with available VRAM.
Testing the installation
Test 1 -> Service and API health
sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/versionExample response:
active{"version":"0.x.x"}
Test 2 -> Single-shot inference with timing
On GPU, the first token should arrive in under 5 seconds. Generation should be noticeably faster than CPU.
time ollama run hermes3:8b "Explain what CUDA is in two sentences."Expected response:
Response in under 10 seconds total including model load time. Token generation at 30-80 tok/s depending on card and quantization.
Test 3 -> Confirm GPU is being used
This is the most important test. Run it immediately after an inference so the model is still loaded:
ollama psExample response:
NAME ID SIZE PROCESSOR UNTIL hermes3:8b a1b2c3d4e5f6 5.8 GB 100% GPU 9 minutes from now
100% GPU confirms the model is fully resident in VRAM with no CPU offloading. Any other percentage means partial offloading is active – see section “Layer offloading for VRAM-constrained cards“.
Test 4 -> VRAM usage during inference
Run this in a second terminal while inference is active to see live VRAM consumption:
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'Example response:
NVIDIA GeForce RTX 3060, 6142 MiB, 146 MiB, 98 %
High GPU utilisation (80-99%) during active inference confirms the GPU is doing the work. Utilisation drops to near zero between tokens being generated – this is normal; the GPU works in bursts during the forward pass.
Test 5 -> Multi-turn context retention
ollama run hermes3:8bExample conversation:
>>> My GPU has 12 GB of VRAM.
Great, that gives you comfortable headroom for most 8B models and some 13B models at lower quantization.
>>> How much VRAM do I have?
You have 12 GB of VRAM.
Exit with /bye or Ctrl+D.
Test 6 -> Custom model variant
ollama run hermes3-gpu "Write a Python function to batch process a list of file paths."Expect clean, commented Python code reflecting the system prompt, generated at GPU speed.
All tests passing? Service active, model listed, ollama ps shows 100% GPU, nvidia-smi shows high utilisation during inference, multi-turn context works, and custom variant runs cleanly. The installation is complete.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
nvidia-smi not found after reboot | Driver did not load | Check dmesg | grep -i nvidia. If Nouveau is still loaded: lsmod | grep nouveau. If present, confirm blacklist file exists at /etc/modprobe.d/blacklist-nvidia-nouveau.conf, run sudo update-initramfs -u, reboot. |
Ollama installed but shows CPU-only in ollama ps | Ollama installed before driver, or driver not detected at install time | Reinstall Ollama after confirming nvidia-smi works: curl -fsSL https://ollama.com/install.sh | sh |
| GPU shown in debug but model still runs on CPU | ollama user not in render group | sudo usermod -aG render ollama && sudo systemctl restart ollama |
Model loads but ollama ps shows partial CPU offload | Model does not fully fit in available VRAM | Use a lower quantization, reduce num_ctx, or set OLLAMA_NUM_GPU manually. See section 09. |
| Token speed 5–15 tok/s despite GPU being detected | Partial layer offloading is silent | Run ollama ps – if PROCESSOR shows any CPU%, you are offloading. Address VRAM fit. |
| VRAM usage spikes and crashes mid-inference | Context window too large for VRAM budget | Reduce num_ctx in Modelfile. Enable flash attention: OLLAMA_FLASH_ATTENTION=1. |
| Display artifacts or system freeze during inference | GPU also driving display; VRAM overcommitted | Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to reserve display driver headroom. |
| Ollama does not detect GPU after kernel update | DKMS module not rebuilt for new kernel | sudo apt install --reinstall nvidia-dkms-570 (replace version with installed driver) |
Diagnostic commands
# Check GPU driver and VRAM state
nvidia-smi
# Live VRAM monitoring
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'
# Check if Nouveau is still loaded (should be empty)
lsmod | grep nouveau
# Check kernel messages for NVIDIA driver errors
sudo dmesg | grep -i nvidia | tail -20
# Check ollama service logs
sudo journalctl -u ollama -n 50 --no-pager
# Force GPU detection debug output
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram'
# Check group membership for current user
groups
# Check DKMS module status
dkms statusEditor’s note
Editor’s note: If for any reason you are limited by hardware, and your computing power boils down to CPU, >>this article<< may be of assistance.