Stack
Prerequisites
- Ubuntu 24.04 LTS, amd64
- NVIDIA GPU, Kepler architecture or newer (GTX 600 series onwards). Compute capability 5.0 or higher required for CUDA.
- At least 16 GB system RAM. The OS, Ollama process, and any KV cache overflow from a VRAM-constrained model all draw from system RAM.
- Disk space: 10 GB for 8B models; 50 GB if pulling 70B.
sudoaccess and internet connectivity.
Table of Contents
- Overview and architecture
- The Llama 3.x family on GPU
- VRAM tiers and model selection
- Prerequisites
- Installing NVIDIA drivers
- Installing Ollama
- Verifying GPU detection
- Pulling Llama 3.1
- Configuring Ollama
- Layer offloading for VRAM-constrained cards
- Creating a custom Modelfile
- Testing the installation
- Troubleshooting
Overview and architecture
This guide covers running Meta’s Llama 3.1 locally on Ubuntu 24.04 with NVIDIA GPU acceleration through Ollama. GPU inference is not a marginal improvement over CPU-only operation. A mid-range consumer card running Llama 3.1 8B at full Q8_0 precision produces 40 to 80 tokens per second, compared to 5 to 15 on CPU. That difference changes what the model is actually useful for.
| Component | Role | Location |
|---|---|---|
| NVIDIA driver | Kernel module that exposes the GPU to userspace. Must be installed and loaded before Ollama is installed. Without it, Ollama falls back to CPU silently. | /usr/lib/modules/ |
| CUDA runtime | Ollama bundles its own CUDA runtime. No separate CUDA toolkit installation is required or recommended. | Bundled inside the Ollama binary |
| Ollama binary | Downloads, manages, and serves models. Detects the NVIDIA driver at startup and offloads inference layers to VRAM automatically. | /usr/local/bin/ollama |
| Llama 3.1 weights | GGUF model file loaded into VRAM on first inference. Quantization and size vary by VRAM tier. | /usr/share/ollama/.ollama/models/ |
No CUDA toolkit needed. Installing the CUDA toolkit before Ollama is a common mistake. Ollama ships the CUDA runtime it needs internally. The only external dependency is the NVIDIA kernel driver. Installing unnecessary CUDA packages adds complexity and version conflict risk with no benefit.
The Llama 3.x family on GPU
The Llama 3.x generation spans three sub-releases with different size ranges. On GPU, the calculus shifts: VRAM replaces system RAM as the constraint, and larger models that would be impractical on CPU become viable.
| Model tag | Parameters | VRAM (Q4_K_M) | GPU viable? |
|---|---|---|---|
llama3.2:1b | 1B | ~1.3 GB | Yes. Fits any modern GPU. Use for edge cases or speed-critical pipelines. |
llama3.2:3b | 3B | ~2.0 GB | Yes. Excellent on 6 GB cards when you want maximum speed over maximum quality. |
llama3.1:8b | 8B | ~4.9 GB | Yes. Primary recommendation for 6 GB+ cards. Best quality-to-VRAM ratio in the family. |
llama3.3:70b | 70B | ~43 GB | Partial. Requires 24 GB VRAM with layer offloading, or two 24 GB cards for full GPU operation. |
Why no Llama 3.2 8B or 3.3 8B? Meta did not release an 8B model in the Llama 3.2 generation, which focused on small (1B, 3B) and vision-capable variants. Llama 3.3 skipped to 70B. The 8B slot belongs exclusively to Llama 3.1, which remains actively maintained and is the correct choice for this size class.
VRAM tiers and model selection
Model weights must fit in VRAM for full GPU acceleration. When they do not, Ollama silently splits transformer layers between GPU and CPU. The split is functional but significantly slower than pure GPU operation due to PCIe data transfer overhead on every forward pass.
6 GB
RTX 3060, RTX 4060
llama3.1:8b (Q4_K_M)
Fits with roughly 1 GB headroom. Keep num_ctx at 4096 or lower to control KV cache size.
8 GB
RTX 3070, RTX 4060 Ti
llama3.1:8b (Q5_K_M)
Higher quality quantization with comfortable headroom. Context up to 8192 tokens viable.
12 GB
RTX 3080 12GB, RTX 4070
llama3.1:8b (Q8_0)
Near full-precision quality. Large context windows up to 32K viable with flash attention enabled.
16 GB
RTX 4080, RTX 3090 Ti
llama3.1:8b (Q8_0)
Full quality 8B with generous KV cache headroom. Extended context sessions are comfortable.
24 GB
RTX 3090, RTX 4090
llama3.3:70b (Q4_K_M)
70B with layer offloading to system RAM. Significantly faster than CPU-only 70B despite the split. Or run 8B at Q8_0 with a very large context window.
VRAM requirements for Llama 3.1 8B across quantizations:
| Model tag | VRAM (weights) | + 8K KV cache | Total VRAM needed |
|---|---|---|---|
llama3.1:8b (Q4_K_M) | ~4.9 GB | ~0.5 GB | ~5.4 GB |
llama3.1:8b-instruct-q5_K_M | ~5.7 GB | ~0.5 GB | ~6.2 GB |
llama3.1:8b-instruct-q8_0 | ~8.5 GB | ~0.5 GB | ~9.0 GB |
llama3.3:70b (Q4_K_M) | ~43 GB | ~2.0 GB | ~45 GB |
Silent partial offloading
When a model does not fully fit in VRAM, Ollama offloads the remainder to system RAM with no terminal warning. You can appear to be running GPU inference while a significant portion of the workload runs on CPU. Token speeds below 10 tok/s on a modern consumer card are the clearest signal. The only definitive check is ollama ps, covered in the testing section.
Confirm the GPU is visible on the PCI bus
lspci | grep -i 'nvidia\|vga'If the card does not appear, the slot may be disabled in BIOS or the card is not seated correctly. Driver installation cannot recover from a GPU the kernel cannot see.
sudo apt update
sudo apt install -y curl zstd ubuntu-drivers-commonInstalling NVIDIA drivers
Ubuntu’s ubuntu-drivers tool selects and installs the recommended proprietary driver from Ubuntu’s repositories. It handles kernel module signing, Nouveau blacklisting, and initramfs updates automatically, making it the most reliable path on Ubuntu 24.04.
Secure Boot
If Secure Boot is enabled in BIOS, the NVIDIA kernel module must be signed with a Machine Owner Key (MOK) to load. The installer will prompt you to set a MOK password. You must then enrol that key at the blue MOK manager screen on the next reboot. Skipping the enrolment step results in the driver being installed but failing to load. On a dedicated inference server where Secure Boot is not a requirement, disabling it in BIOS is the fastest resolution.
Check what the tool recommends before installing
ubuntu-drivers devicesInstall the recommended driver and reboot
sudo ubuntu-drivers install
sudo rebootVerify the driver loaded after reboot
nvidia-smi+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A |
| 30% 38C P8 12W / 170W | 0MiB / 12288MiB | 0% Default |
+-----------------------------------------------------------------------------+
Confirm the driver version line is present, CUDA Version is populated, and the card’s VRAM total appears under Memory-Usage. If nvidia-smi returns command not found or an error, the driver did not load. Check dmesg | grep -i nvidia for kernel-level errors.
Nouveau conflict
The proprietary NVIDIA driver and the open-source Nouveau driver cannot coexist. The installer blacklists Nouveau automatically. If the NVIDIA driver fails to load after reboot, verify the blacklist file exists at /etc/modprobe.d/blacklist-nvidia-nouveau.conf, then run sudo update-initramfs -u and reboot again.
Installing Ollama
The Ollama installation process is identical to the CPU setup. The key difference is that Ollama detects the NVIDIA driver during installation and configures GPU acceleration automatically. This is why the driver must be installed and confirmed working before this step.
Order matters
Installing Ollama before the NVIDIA driver means Ollama configures itself for CPU-only operation. Even if you install the driver afterwards, Ollama may not pick up GPU acceleration correctly without reinstalling. Always confirm nvidia-smi works before running the Ollama installer.
curl -fsSL https://ollama.com/install.sh | shThe GPU detection line in the installer output confirms Ollama found the driver. If it is absent, revisit section 05.
ollama --versionVerifying GPU detection
Confirm the service is running, then use debug output to verify Ollama can see and enumerate the GPU before pulling any model.
sudo systemctl status ollama● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled)
Active: active (running) since ...
Temporarily stop the service and run with debug logging to inspect GPU enumeration:
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram' | head -20Press Ctrl+C to stop the debug instance, then restart the service:
sudo systemctl start ollama
curl -fsS http://127.0.0.1:11434/api/versionNo GPU lines in debug output? If the debug grep returns nothing, Ollama is not detecting the driver. Common causes: Nouveau is still loaded (lsmod | grep nouveau), the ollama system user is not in the render group, or the kernel module loaded with errors (dmesg | grep -i nvidia). Each of these is addressed in the troubleshooting section.
Pulling Llama 3.1
Pull the model matching your VRAM tier from section 03. Use the exact tag for non-default quantizations; omitting the quantization suffix always fetches Q4_K_M.
6 GB VRAM – Q4_K_M (default)
ollama pull llama3.1:8b8 GB VRAM – Q5_K_M
ollama pull llama3.1:8b-instruct-q5_K_M12 GB+ VRAM – Q8_0
ollama pull llama3.1:8b-instruct-q8_024 GB VRAM – 70B
ollama pull llama3.3:70b70B on a single 24 GB card
Llama 3.3 70B at Q4_K_M requires approximately 43 GB of VRAM for weights alone. A single 24 GB card cannot hold it entirely. Ollama will offload the excess layers to system RAM automatically. Inference is still substantially faster than CPU-only operation, but it is not full GPU inference. Section 10 covers how to control and monitor the layer split.
Confirm the model is registered:
ollama listConfiguring Ollama
GPU-specific settings are applied through environment variables in a systemd override file. Do not edit the main service unit directly; override files survive package upgrades.
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep model loaded between requests
Environment="OLLAMA_KEEP_ALIVE=10m"
# Restrict API to localhost (default, but explicit)
Environment="OLLAMA_HOST=127.0.0.1:11434"
# Flash attention reduces VRAM KV cache usage by 40-60% on Turing and newer
Environment="OLLAMA_FLASH_ATTENTION=1"
# One model in VRAM at a time
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollamaGPU-relevant environment variables
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_FLASH_ATTENTION | 0 | Enables Flash Attention, reducing VRAM usage for the KV cache by 40 to 60 percent on Turing (RTX 20xx) and newer. Enable on any supported card. |
OLLAMA_NUM_GPU | All layers | Number of transformer layers to load on GPU. Controls the GPU/CPU split for partial offloading. See section 10. |
OLLAMA_GPU_MEMORY_FRACTION | 1.0 | Fraction of available VRAM Ollama may use. Set to 0.85 on cards also driving a display to reserve headroom for the compositor. |
CUDA_VISIBLE_DEVICES | All GPUs | Comma-separated GPU indices Ollama is allowed to use. On single-GPU systems this is unnecessary. |
OLLAMA_MAX_LOADED_MODELS | 1 | Models held in VRAM simultaneously. Increasing this only makes sense if multiple models fit. |
Card connected to a display? A GPU also running a desktop environment or display output reserves 200 to 400 MB of VRAM for the compositor. On a 6 GB card this is significant. Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to prevent the model load from crowding out the display driver, which causes system instability.
Layer offloading for VRAM-constrained cards
When a model exceeds available VRAM, Ollama distributes transformer layers across GPU and CPU. GPU layers run at VRAM bandwidth speed. CPU layers run at system RAM speed with added PCIe transfer overhead at every layer boundary. Ollama’s automatic split is silent and sometimes miscalculated, particularly when the display driver is consuming VRAM that Ollama cannot account for.
Manual control via OLLAMA_NUM_GPU gives you an explicit and reproducible configuration.
Check the automatic split first
# Load the model with a quick prompt
ollama run llama3.1:8b "hello"
# Inspect the GPU/CPU split
ollama psAny CPU percentage means partial offloading is active. Determine how many layers the model has:
ollama show llama3.1:8b | grep -i 'layer\|param'Manual control via Modelfile
cat > ~/ollama-models/Modelfile.llama31-gpu <<'EOF'
FROM llama3.1:8b
# Load 28 of 32 layers on GPU, remainder handled by CPU
# Adjust based on how much VRAM remains after model load
PARAMETER num_gpu 28
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER repeat_penalty 1.1
EOF
ollama create llama31-gpu -f ~/ollama-models/Modelfile.llama31-gpuManual control via environment variable
To apply a layer limit globally across all models loaded by the service:
# Add to /etc/systemd/system/ollama.service.d/override.conf
# Set to 999 to attempt full GPU loading (Ollama caps at actual layer count)
# Reduce in increments of 4 if VRAM is insufficient
Environment="OLLAMA_NUM_GPU=999"After editing the override file, always reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollamaPartial offload performance penalty
Performance does not degrade linearly with the number of CPU layers. Even 4 CPU layers out of 32 can reduce throughput by 30 to 50 percent because every forward pass requires a synchronisation point at the GPU/CPU boundary. If full GPU fit is not achievable, maximise the number of layers on GPU rather than targeting an even split.
Creating a custom Modelfile
GPU inference opens up larger context windows without the system RAM penalty of CPU operation. The Modelfile below is tuned for a GPU setup with the context ceiling calibrated to a 12 GB card.
mkdir -p ~/ollama-models
cat > ~/ollama-models/Modelfile.llama31 <<'EOF'
# Adjust the FROM tag to match what you pulled in section 08
FROM llama3.1:8b-instruct-q8_0
SYSTEM """
You are a knowledgeable and precise technical assistant. Answer questions
directly and completely. When writing code, include comments explaining
non-obvious decisions. When uncertain, say so explicitly.
"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
# 16K context is viable on 12 GB+ VRAM with flash attention enabled
# Reduce to 4096 for 6 GB cards
PARAMETER num_ctx 16384
PARAMETER repeat_penalty 1.1
EOF
ollama create llama31-gpu -f ~/ollama-models/Modelfile.llama31Context window and VRAM on GPU
On GPU, the KV cache for the context window lives in VRAM. A num_ctx of 16384 consumes approximately 1 GB of VRAM on an 8B model. With flash attention enabled this is reduced by 40 to 60 percent, making large context windows much more practical. On 6 GB cards, keep num_ctx at 4096 and enable OLLAMA_FLASH_ATTENTION=1 in the systemd override.
Testing the installation
Test 1 -> Service and API health
sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/versionExample response:
active{"version":"0.x.x"}
Test 2 -> Single-shot inference with timing
On GPU the first token should arrive in under 5 seconds. Generation should be clearly faster than CPU-only inference.
time ollama run llama3.1:8b "Explain what a transformer attention mechanism does in two sentences."Expected response:
response complete in under 10 seconds including model load time. Token generation at 30 to 80 tok/s depending on card and quantization.
Test 3 -> Confirm full GPU utilisation
This is the critical test. Run it immediately after inference while the model is still loaded:
ollama psExample response:
NAME ID SIZE PROCESSOR UNTIL llama3.1:8b a1b2c3d4e5f6 5.8 GB 100% GPU 9 minutes from now
100% GPU confirms the model is fully resident in VRAM. Any CPU percentage indicates partial offloading – address it using section 10 before continuing.
Test 4 -> Live VRAM monitoring during inference
Open a second terminal and run this while inference is active:
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'Example response:
NVIDIA GeForce RTX 3060, 6142 MiB, 146 MiB, 97 %
High GPU utilisation (80 to 99 percent) during active generation confirms the GPU is doing the work. Utilisation drops near zero between tokens – the GPU works in bursts during each forward pass, not continuously.
Test 5 -> Multi-turn context retention
ollama run llama3.1:8bExample conversation:
>>> My server has an RTX 3060 with 12 GB of VRAM.
That gives you solid headroom for 8B models at Q8_0 and comfortable context windows. What are you planning to run on it?
>>> How much VRAM does my GPU have?
Your RTX 3060 has 12 GB of VRAM.
The model recalling the VRAM figure from earlier in the session confirms multi-turn context is working. Exit with /bye or Ctrl+D.
Test 6 -> Custom model variant
ollama run llama31-gpu "Write a bash script that watches a directory and logs any new files created."Expect clean, commented bash code generated at GPU speed, reflecting the system prompt instruction to explain non-obvious decisions.
All tests passing? Service active, model listed, ollama ps shows 100% GPU, nvidia-smi shows high utilisation during inference, multi-turn context works, and the custom variant produces well-commented output. The installation is complete.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
nvidia-smi not found after reboot | Driver did not load | Check dmesg | grep -i nvidia for kernel errors. Verify Nouveau is blacklisted: cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf. If the file is missing, run sudo update-initramfs -u and reboot. |
Ollama running but ollama ps shows CPU only | Ollama installed before the driver | Confirm nvidia-smi works, then reinstall Ollama: curl -fsSL https://ollama.com/install.sh | sh |
| GPU visible in debug output but model runs on CPU | ollama system user missing from render group | sudo usermod -aG render ollama && sudo systemctl restart ollama |
ollama ps shows partial CPU offload | Model does not fully fit in available VRAM | Use a lower quantization, reduce num_ctx, or set OLLAMA_NUM_GPU explicitly. See section 10. |
| Token speed 5 to 15 tok/s despite GPU detected | Silent partial offloading | Check ollama ps. Any CPU percentage in the PROCESSOR column means offloading is active. |
| VRAM exhaustion or crash mid-inference | Context window too large for available VRAM | Reduce num_ctx in Modelfile. Enable flash attention: OLLAMA_FLASH_ATTENTION=1. |
| Display corruption or system freeze during inference | GPU also driving display; VRAM overcommitted | Set OLLAMA_GPU_MEMORY_FRACTION=0.85 in the systemd override. |
| GPU not detected after kernel upgrade | DKMS module not rebuilt for new kernel | sudo apt install --reinstall nvidia-dkms-570 (replace with installed driver version). Then sudo reboot. |
| Model pull fails or stalls | Network interruption | Re-run ollama pull with the same tag. Downloads resume from the last completed chunk. |
Diagnostic commands
# Full GPU status
nvidia-smi
# Live VRAM and utilisation monitoring
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'
# Check Nouveau is not loaded (output should be empty)
lsmod | grep nouveau
# Kernel messages for NVIDIA driver errors
sudo dmesg | grep -i nvidia | tail -20
# Ollama service logs
sudo journalctl -u ollama -n 50 --no-pager
# Force GPU detection debug output
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram'
# Confirm group membership for current user
groups
# Check DKMS module build status
dkms status
# Check current Ollama environment settings
sudo systemctl cat ollama | grep EnvironmentEditor’s note
Editor’s note: If for any reason you are limited by hardware, and your computing power boils down to CPU, >>this article<< may be of assistance.