Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04 (NVIDIA GPU)

Stack

Prerequisites

  • Ubuntu 24.04 LTS, amd64
  • NVIDIA GPU, Kepler architecture or newer (GTX 600 series onwards). Compute capability 5.0 or higher required for CUDA.
  • At least 16 GB system RAM. The OS, Ollama process, and any KV cache overflow from a VRAM-constrained model all draw from system RAM.
  • Disk space: 10 GB for 8B models; 50 GB if pulling 70B.
  • sudo access and internet connectivity.

Table of Contents

  1. Overview and architecture
  2. The Llama 3.x family on GPU
  3. VRAM tiers and model selection
  4. Prerequisites
  5. Installing NVIDIA drivers
  6. Installing Ollama
  7. Verifying GPU detection
  8. Pulling Llama 3.1
  9. Configuring Ollama
  10. Layer offloading for VRAM-constrained cards
  11. Creating a custom Modelfile
  12. Testing the installation
  13. Troubleshooting

Overview and architecture

This guide covers running Meta’s Llama 3.1 locally on Ubuntu 24.04 with NVIDIA GPU acceleration through Ollama. GPU inference is not a marginal improvement over CPU-only operation. A mid-range consumer card running Llama 3.1 8B at full Q8_0 precision produces 40 to 80 tokens per second, compared to 5 to 15 on CPU. That difference changes what the model is actually useful for.

ComponentRoleLocation
NVIDIA driverKernel module that exposes the GPU to userspace. Must be installed and loaded before Ollama is installed. Without it, Ollama falls back to CPU silently./usr/lib/modules/
CUDA runtimeOllama bundles its own CUDA runtime. No separate CUDA toolkit installation is required or recommended.Bundled inside the Ollama binary
Ollama binaryDownloads, manages, and serves models. Detects the NVIDIA driver at startup and offloads inference layers to VRAM automatically./usr/local/bin/ollama
Llama 3.1 weightsGGUF model file loaded into VRAM on first inference. Quantization and size vary by VRAM tier./usr/share/ollama/.ollama/models/

No CUDA toolkit needed. Installing the CUDA toolkit before Ollama is a common mistake. Ollama ships the CUDA runtime it needs internally. The only external dependency is the NVIDIA kernel driver. Installing unnecessary CUDA packages adds complexity and version conflict risk with no benefit.

The Llama 3.x family on GPU

The Llama 3.x generation spans three sub-releases with different size ranges. On GPU, the calculus shifts: VRAM replaces system RAM as the constraint, and larger models that would be impractical on CPU become viable.

Model tagParametersVRAM (Q4_K_M)GPU viable?
llama3.2:1b1B~1.3 GBYes. Fits any modern GPU. Use for edge cases or speed-critical pipelines.
llama3.2:3b3B~2.0 GBYes. Excellent on 6 GB cards when you want maximum speed over maximum quality.
llama3.1:8b8B~4.9 GBYes. Primary recommendation for 6 GB+ cards. Best quality-to-VRAM ratio in the family.
llama3.3:70b70B~43 GBPartial. Requires 24 GB VRAM with layer offloading, or two 24 GB cards for full GPU operation.

Why no Llama 3.2 8B or 3.3 8B? Meta did not release an 8B model in the Llama 3.2 generation, which focused on small (1B, 3B) and vision-capable variants. Llama 3.3 skipped to 70B. The 8B slot belongs exclusively to Llama 3.1, which remains actively maintained and is the correct choice for this size class.

VRAM tiers and model selection

Model weights must fit in VRAM for full GPU acceleration. When they do not, Ollama silently splits transformer layers between GPU and CPU. The split is functional but significantly slower than pure GPU operation due to PCIe data transfer overhead on every forward pass.

6 GB
RTX 3060, RTX 4060
llama3.1:8b (Q4_K_M)
Fits with roughly 1 GB headroom. Keep num_ctx at 4096 or lower to control KV cache size.

8 GB
RTX 3070, RTX 4060 Ti
llama3.1:8b (Q5_K_M)
Higher quality quantization with comfortable headroom. Context up to 8192 tokens viable.

12 GB
RTX 3080 12GB, RTX 4070
llama3.1:8b (Q8_0)
Near full-precision quality. Large context windows up to 32K viable with flash attention enabled.

16 GB
RTX 4080, RTX 3090 Ti
llama3.1:8b (Q8_0)
Full quality 8B with generous KV cache headroom. Extended context sessions are comfortable.

24 GB
RTX 3090, RTX 4090
llama3.3:70b (Q4_K_M)
70B with layer offloading to system RAM. Significantly faster than CPU-only 70B despite the split. Or run 8B at Q8_0 with a very large context window.

VRAM requirements for Llama 3.1 8B across quantizations:

Model tagVRAM (weights)+ 8K KV cacheTotal VRAM needed
llama3.1:8b (Q4_K_M)~4.9 GB~0.5 GB~5.4 GB
llama3.1:8b-instruct-q5_K_M~5.7 GB~0.5 GB~6.2 GB
llama3.1:8b-instruct-q8_0~8.5 GB~0.5 GB~9.0 GB
llama3.3:70b (Q4_K_M)~43 GB~2.0 GB~45 GB

Confirm the GPU is visible on the PCI bus

lspci | grep -i 'nvidia\|vga'

If the card does not appear, the slot may be disabled in BIOS or the card is not seated correctly. Driver installation cannot recover from a GPU the kernel cannot see.

sudo apt update
sudo apt install -y curl zstd ubuntu-drivers-common

Installing NVIDIA drivers

Ubuntu’s ubuntu-drivers tool selects and installs the recommended proprietary driver from Ubuntu’s repositories. It handles kernel module signing, Nouveau blacklisting, and initramfs updates automatically, making it the most reliable path on Ubuntu 24.04.

Check what the tool recommends before installing

ubuntu-drivers devices
sudo ubuntu-drivers install
sudo reboot

Verify the driver loaded after reboot

nvidia-smi
+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x           | 
|-------------------------------+----------------------+----------------------+ 
| GPU Name Persistence-M        | Bus-Id Disp.A        | Volatile Uncorr. ECC | 
| Fan Temp Perf Pwr:Usage/Cap   | Memory-Usage         | GPU-Util Compute M.  | |===============================+======================+======================| 
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A                  | 
| 30% 38C P8 12W / 170W         | 0MiB / 12288MiB      | 0% Default           | 
+-----------------------------------------------------------------------------+

Confirm the driver version line is present, CUDA Version is populated, and the card’s VRAM total appears under Memory-Usage. If nvidia-smi returns command not found or an error, the driver did not load. Check dmesg | grep -i nvidia for kernel-level errors.

Installing Ollama

The Ollama installation process is identical to the CPU setup. The key difference is that Ollama detects the NVIDIA driver during installation and configures GPU acceleration automatically. This is why the driver must be installed and confirmed working before this step.

curl -fsSL https://ollama.com/install.sh | sh

The GPU detection line in the installer output confirms Ollama found the driver. If it is absent, revisit section 05.

ollama --version

Verifying GPU detection

Confirm the service is running, then use debug output to verify Ollama can see and enumerate the GPU before pulling any model.

sudo systemctl status ollama
● ollama.service - Ollama Service 
    Loaded: loaded (/etc/systemd/system/ollama.service; enabled) 
    Active: active (running) since ...

Temporarily stop the service and run with debug logging to inspect GPU enumeration:

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram' | head -20

Press Ctrl+C to stop the debug instance, then restart the service:

sudo systemctl start ollama
curl -fsS http://127.0.0.1:11434/api/version

No GPU lines in debug output? If the debug grep returns nothing, Ollama is not detecting the driver. Common causes: Nouveau is still loaded (lsmod | grep nouveau), the ollama system user is not in the render group, or the kernel module loaded with errors (dmesg | grep -i nvidia). Each of these is addressed in the troubleshooting section.

Pulling Llama 3.1

Pull the model matching your VRAM tier from section 03. Use the exact tag for non-default quantizations; omitting the quantization suffix always fetches Q4_K_M.

6 GB VRAM – Q4_K_M (default)

ollama pull llama3.1:8b

8 GB VRAM – Q5_K_M

ollama pull llama3.1:8b-instruct-q5_K_M

12 GB+ VRAM – Q8_0

ollama pull llama3.1:8b-instruct-q8_0

24 GB VRAM – 70B

ollama pull llama3.3:70b

Confirm the model is registered:

ollama list

Configuring Ollama

GPU-specific settings are applied through environment variables in a systemd override file. Do not edit the main service unit directly; override files survive package upgrades.

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep model loaded between requests
Environment="OLLAMA_KEEP_ALIVE=10m"

# Restrict API to localhost (default, but explicit)
Environment="OLLAMA_HOST=127.0.0.1:11434"

# Flash attention reduces VRAM KV cache usage by 40-60% on Turing and newer
Environment="OLLAMA_FLASH_ATTENTION=1"

# One model in VRAM at a time
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

GPU-relevant environment variables

VariableDefaultPurpose
OLLAMA_FLASH_ATTENTION0Enables Flash Attention, reducing VRAM usage for the KV cache by 40 to 60 percent on Turing (RTX 20xx) and newer. Enable on any supported card.
OLLAMA_NUM_GPUAll layersNumber of transformer layers to load on GPU. Controls the GPU/CPU split for partial offloading. See section 10.
OLLAMA_GPU_MEMORY_FRACTION1.0Fraction of available VRAM Ollama may use. Set to 0.85 on cards also driving a display to reserve headroom for the compositor.
CUDA_VISIBLE_DEVICESAll GPUsComma-separated GPU indices Ollama is allowed to use. On single-GPU systems this is unnecessary.
OLLAMA_MAX_LOADED_MODELS1Models held in VRAM simultaneously. Increasing this only makes sense if multiple models fit.

Card connected to a display? A GPU also running a desktop environment or display output reserves 200 to 400 MB of VRAM for the compositor. On a 6 GB card this is significant. Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to prevent the model load from crowding out the display driver, which causes system instability.

Layer offloading for VRAM-constrained cards

When a model exceeds available VRAM, Ollama distributes transformer layers across GPU and CPU. GPU layers run at VRAM bandwidth speed. CPU layers run at system RAM speed with added PCIe transfer overhead at every layer boundary. Ollama’s automatic split is silent and sometimes miscalculated, particularly when the display driver is consuming VRAM that Ollama cannot account for.

Manual control via OLLAMA_NUM_GPU gives you an explicit and reproducible configuration.

Check the automatic split first

# Load the model with a quick prompt
ollama run llama3.1:8b "hello"

# Inspect the GPU/CPU split
ollama ps

Any CPU percentage means partial offloading is active. Determine how many layers the model has:

ollama show llama3.1:8b | grep -i 'layer\|param'

Manual control via Modelfile

cat > ~/ollama-models/Modelfile.llama31-gpu <<'EOF'
FROM llama3.1:8b

# Load 28 of 32 layers on GPU, remainder handled by CPU
# Adjust based on how much VRAM remains after model load
PARAMETER num_gpu 28

PARAMETER temperature    0.7
PARAMETER top_p          0.9
PARAMETER num_ctx        4096
PARAMETER repeat_penalty 1.1
EOF

ollama create llama31-gpu -f ~/ollama-models/Modelfile.llama31-gpu

Manual control via environment variable

To apply a layer limit globally across all models loaded by the service:

# Add to /etc/systemd/system/ollama.service.d/override.conf
# Set to 999 to attempt full GPU loading (Ollama caps at actual layer count)
# Reduce in increments of 4 if VRAM is insufficient
Environment="OLLAMA_NUM_GPU=999"

After editing the override file, always reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Creating a custom Modelfile

GPU inference opens up larger context windows without the system RAM penalty of CPU operation. The Modelfile below is tuned for a GPU setup with the context ceiling calibrated to a 12 GB card.

mkdir -p ~/ollama-models

cat > ~/ollama-models/Modelfile.llama31 <<'EOF'
# Adjust the FROM tag to match what you pulled in section 08
FROM llama3.1:8b-instruct-q8_0

SYSTEM """
You are a knowledgeable and precise technical assistant. Answer questions
directly and completely. When writing code, include comments explaining
non-obvious decisions. When uncertain, say so explicitly.
"""

PARAMETER temperature    0.7
PARAMETER top_p          0.9
# 16K context is viable on 12 GB+ VRAM with flash attention enabled
# Reduce to 4096 for 6 GB cards
PARAMETER num_ctx        16384
PARAMETER repeat_penalty 1.1
EOF

ollama create llama31-gpu -f ~/ollama-models/Modelfile.llama31

Testing the installation

Test 1 -> Service and API health

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

Example response:

active{"version":"0.x.x"}

Test 2 -> Single-shot inference with timing

On GPU the first token should arrive in under 5 seconds. Generation should be clearly faster than CPU-only inference.

time ollama run llama3.1:8b "Explain what a transformer attention mechanism does in two sentences."

Expected response:

response complete in under 10 seconds including model load time. Token generation at 30 to 80 tok/s depending on card and quantization.

Test 3 -> Confirm full GPU utilisation

This is the critical test. Run it immediately after inference while the model is still loaded:

ollama ps

Example response:

NAME ID SIZE PROCESSOR UNTIL llama3.1:8b a1b2c3d4e5f6 5.8 GB 100% GPU 9 minutes from now

100% GPU confirms the model is fully resident in VRAM. Any CPU percentage indicates partial offloading – address it using section 10 before continuing.

Test 4 -> Live VRAM monitoring during inference

Open a second terminal and run this while inference is active:

watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

Example response:

NVIDIA GeForce RTX 3060, 6142 MiB, 146 MiB, 97 %

High GPU utilisation (80 to 99 percent) during active generation confirms the GPU is doing the work. Utilisation drops near zero between tokens – the GPU works in bursts during each forward pass, not continuously.

Test 5 -> Multi-turn context retention

ollama run llama3.1:8b

Example conversation:

>>> My server has an RTX 3060 with 12 GB of VRAM. 
That gives you solid headroom for 8B models at Q8_0 and comfortable context windows. What are you planning to run on it?

>>> How much VRAM does my GPU have? 
Your RTX 3060 has 12 GB of VRAM.

The model recalling the VRAM figure from earlier in the session confirms multi-turn context is working. Exit with /bye or Ctrl+D.

Test 6 -> Custom model variant

ollama run llama31-gpu "Write a bash script that watches a directory and logs any new files created."

Expect clean, commented bash code generated at GPU speed, reflecting the system prompt instruction to explain non-obvious decisions.

All tests passing? Service active, model listed, ollama ps shows 100% GPU, nvidia-smi shows high utilisation during inference, multi-turn context works, and the custom variant produces well-commented output. The installation is complete.

Troubleshooting

SymptomLikely causeFix
nvidia-smi not found after rebootDriver did not loadCheck dmesg | grep -i nvidia for kernel errors. Verify Nouveau is blacklisted: cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf. If the file is missing, run sudo update-initramfs -u and reboot.
Ollama running but ollama ps shows CPU onlyOllama installed before the driverConfirm nvidia-smi works, then reinstall Ollama: curl -fsSL https://ollama.com/install.sh | sh
GPU visible in debug output but model runs on CPUollama system user missing from render groupsudo usermod -aG render ollama && sudo systemctl restart ollama
ollama ps shows partial CPU offloadModel does not fully fit in available VRAMUse a lower quantization, reduce num_ctx, or set OLLAMA_NUM_GPU explicitly. See section 10.
Token speed 5 to 15 tok/s despite GPU detectedSilent partial offloadingCheck ollama ps. Any CPU percentage in the PROCESSOR column means offloading is active.
VRAM exhaustion or crash mid-inferenceContext window too large for available VRAMReduce num_ctx in Modelfile. Enable flash attention: OLLAMA_FLASH_ATTENTION=1.
Display corruption or system freeze during inferenceGPU also driving display; VRAM overcommittedSet OLLAMA_GPU_MEMORY_FRACTION=0.85 in the systemd override.
GPU not detected after kernel upgradeDKMS module not rebuilt for new kernelsudo apt install --reinstall nvidia-dkms-570 (replace with installed driver version). Then sudo reboot.
Model pull fails or stallsNetwork interruptionRe-run ollama pull with the same tag. Downloads resume from the last completed chunk.

Diagnostic commands

# Full GPU status
nvidia-smi

# Live VRAM and utilisation monitoring
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

# Check Nouveau is not loaded (output should be empty)
lsmod | grep nouveau

# Kernel messages for NVIDIA driver errors
sudo dmesg | grep -i nvidia | tail -20

# Ollama service logs
sudo journalctl -u ollama -n 50 --no-pager

# Force GPU detection debug output
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram'

# Confirm group membership for current user
groups

# Check DKMS module build status
dkms status

# Check current Ollama environment settings
sudo systemctl cat ollama | grep Environment

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link