Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04 (NVIDIA GPU)

Stack

Prerequisites

Ubuntu 24.04 LTS, amd64
NVIDIA GPU, Kepler architecture or newer (GTX 600 series onwards). Compute capability 5.0 or higher required for CUDA.
At least 16 GB system RAM. The OS, Ollama process, and any KV cache overflow from a VRAM-constrained model all draw from system RAM.
Disk space: 10 GB for 8B models; 50 GB if pulling 70B.
sudo access and internet connectivity.

Overview and architecture
The Llama 3.x family on GPU
VRAM tiers and model selection
Prerequisites
Installing NVIDIA drivers
Installing Ollama
Verifying GPU detection
Pulling Llama 3.1
Configuring Ollama
Layer offloading for VRAM-constrained cards
Creating a custom Modelfile
Testing the installation
Troubleshooting

Overview and architecture

This guide covers running Meta’s Llama 3.1 locally on Ubuntu 24.04 with NVIDIA GPU acceleration through Ollama. GPU inference is not a marginal improvement over CPU-only operation. A mid-range consumer card running Llama 3.1 8B at full Q8_0 precision produces 40 to 80 tokens per second, compared to 5 to 15 on CPU. That difference changes what the model is actually useful for.

Component	Role	Location
NVIDIA driver	Kernel module that exposes the GPU to userspace. Must be installed and loaded before Ollama is installed. Without it, Ollama falls back to CPU silently.	`/usr/lib/modules/`
CUDA runtime	Ollama bundles its own CUDA runtime. No separate CUDA toolkit installation is required or recommended.	Bundled inside the Ollama binary
Ollama binary	Downloads, manages, and serves models. Detects the NVIDIA driver at startup and offloads inference layers to VRAM automatically.	`/usr/local/bin/ollama`
Llama 3.1 weights	GGUF model file loaded into VRAM on first inference. Quantization and size vary by VRAM tier.	`/usr/share/ollama/.ollama/models/`

No CUDA toolkit needed. Installing the CUDA toolkit before Ollama is a common mistake. Ollama ships the CUDA runtime it needs internally. The only external dependency is the NVIDIA kernel driver. Installing unnecessary CUDA packages adds complexity and version conflict risk with no benefit.

The Llama 3.x family on GPU

The Llama 3.x generation spans three sub-releases with different size ranges. On GPU, the calculus shifts: VRAM replaces system RAM as the constraint, and larger models that would be impractical on CPU become viable.

Model tag	Parameters	VRAM (Q4_K_M)	GPU viable?
`llama3.2:1b`	1B	~1.3 GB	Yes. Fits any modern GPU. Use for edge cases or speed-critical pipelines.
`llama3.2:3b`	3B	~2.0 GB	Yes. Excellent on 6 GB cards when you want maximum speed over maximum quality.
`llama3.1:8b`	8B	~4.9 GB	Yes. Primary recommendation for 6 GB+ cards. Best quality-to-VRAM ratio in the family.
`llama3.3:70b`	70B	~43 GB	Partial. Requires 24 GB VRAM with layer offloading, or two 24 GB cards for full GPU operation.

Why no Llama 3.2 8B or 3.3 8B? Meta did not release an 8B model in the Llama 3.2 generation, which focused on small (1B, 3B) and vision-capable variants. Llama 3.3 skipped to 70B. The 8B slot belongs exclusively to Llama 3.1, which remains actively maintained and is the correct choice for this size class.

VRAM tiers and model selection

Model weights must fit in VRAM for full GPU acceleration. When they do not, Ollama silently splits transformer layers between GPU and CPU. The split is functional but significantly slower than pure GPU operation due to PCIe data transfer overhead on every forward pass.

6 GB
RTX 3060, RTX 4060
llama3.1:8b (Q4_K_M)
Fits with roughly 1 GB headroom. Keep num_ctx at 4096 or lower to control KV cache size.

8 GB
RTX 3070, RTX 4060 Ti
llama3.1:8b (Q5_K_M)
Higher quality quantization with comfortable headroom. Context up to 8192 tokens viable.

12 GB
RTX 3080 12GB, RTX 4070
llama3.1:8b (Q8_0)
Near full-precision quality. Large context windows up to 32K viable with flash attention enabled.

16 GB
RTX 4080, RTX 3090 Ti
llama3.1:8b (Q8_0)
Full quality 8B with generous KV cache headroom. Extended context sessions are comfortable.

24 GB
RTX 3090, RTX 4090
llama3.3:70b (Q4_K_M)
70B with layer offloading to system RAM. Significantly faster than CPU-only 70B despite the split. Or run 8B at Q8_0 with a very large context window.

VRAM requirements for Llama 3.1 8B across quantizations:

Model tag	VRAM (weights)	+ 8K KV cache	Total VRAM needed
`llama3.1:8b` (Q4_K_M)	~4.9 GB	~0.5 GB	~5.4 GB
`llama3.1:8b-instruct-q5_K_M`	~5.7 GB	~0.5 GB	~6.2 GB
`llama3.1:8b-instruct-q8_0`	~8.5 GB	~0.5 GB	~9.0 GB
`llama3.3:70b` (Q4_K_M)	~43 GB	~2.0 GB	~45 GB

Silent partial offloading

When a model does not fully fit in VRAM, Ollama offloads the remainder to system RAM with no terminal warning. You can appear to be running GPU inference while a significant portion of the workload runs on CPU. Token speeds below 10 tok/s on a modern consumer card are the clearest signal. The only definitive check is ollama ps, covered in the testing section.

Confirm the GPU is visible on the PCI bus

lspci | grep -i 'nvidia\|vga'

If the card does not appear, the slot may be disabled in BIOS or the card is not seated correctly. Driver installation cannot recover from a GPU the kernel cannot see.

sudo apt update
sudo apt install -y curl zstd ubuntu-drivers-common

Installing NVIDIA drivers

Ubuntu’s ubuntu-drivers tool selects and installs the recommended proprietary driver from Ubuntu’s repositories. It handles kernel module signing, Nouveau blacklisting, and initramfs updates automatically, making it the most reliable path on Ubuntu 24.04.

Secure Boot

If Secure Boot is enabled in BIOS, the NVIDIA kernel module must be signed with a Machine Owner Key (MOK) to load. The installer will prompt you to set a MOK password. You must then enrol that key at the blue MOK manager screen on the next reboot. Skipping the enrolment step results in the driver being installed but failing to load. On a dedicated inference server where Secure Boot is not a requirement, disabling it in BIOS is the fastest resolution.

Check what the tool recommends before installing

ubuntu-drivers devices

Install the recommended driver and reboot

sudo ubuntu-drivers install
sudo reboot

Verify the driver loaded after reboot

nvidia-smi

+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 570.xx.xx Driver Version: 570.xx.xx CUDA Version: 12.x           | 
|-------------------------------+----------------------+----------------------+ 
| GPU Name Persistence-M        | Bus-Id Disp.A        | Volatile Uncorr. ECC | 
| Fan Temp Perf Pwr:Usage/Cap   | Memory-Usage         | GPU-Util Compute M.  | |===============================+======================+======================| 
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:01:00.0 Off | N/A                  | 
| 30% 38C P8 12W / 170W         | 0MiB / 12288MiB      | 0% Default           | 
+-----------------------------------------------------------------------------+

Confirm the driver version line is present, CUDA Version is populated, and the card’s VRAM total appears under Memory-Usage. If nvidia-smi returns command not found or an error, the driver did not load. Check dmesg | grep -i nvidia for kernel-level errors.

Nouveau conflict

The proprietary NVIDIA driver and the open-source Nouveau driver cannot coexist. The installer blacklists Nouveau automatically. If the NVIDIA driver fails to load after reboot, verify the blacklist file exists at /etc/modprobe.d/blacklist-nvidia-nouveau.conf, then run sudo update-initramfs -u and reboot again.

Installing Ollama

The Ollama installation process is identical to the CPU setup. The key difference is that Ollama detects the NVIDIA driver during installation and configures GPU acceleration automatically. This is why the driver must be installed and confirmed working before this step.

Order matters

Installing Ollama before the NVIDIA driver means Ollama configures itself for CPU-only operation. Even if you install the driver afterwards, Ollama may not pick up GPU acceleration correctly without reinstalling. Always confirm nvidia-smi works before running the Ollama installer.

curl -fsSL https://ollama.com/install.sh | sh

The GPU detection line in the installer output confirms Ollama found the driver. If it is absent, revisit section 05.

ollama --version

Verifying GPU detection

Confirm the service is running, then use debug output to verify Ollama can see and enumerate the GPU before pulling any model.

sudo systemctl status ollama

● ollama.service - Ollama Service 
    Loaded: loaded (/etc/systemd/system/ollama.service; enabled) 
    Active: active (running) since ...

Temporarily stop the service and run with debug logging to inspect GPU enumeration:

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram' | head -20

Press Ctrl+C to stop the debug instance, then restart the service:

sudo systemctl start ollama
curl -fsS http://127.0.0.1:11434/api/version

No GPU lines in debug output? If the debug grep returns nothing, Ollama is not detecting the driver. Common causes: Nouveau is still loaded (lsmod | grep nouveau), the ollama system user is not in the render group, or the kernel module loaded with errors (dmesg | grep -i nvidia). Each of these is addressed in the troubleshooting section.

Pulling Llama 3.1

Pull the model matching your VRAM tier from section 03. Use the exact tag for non-default quantizations; omitting the quantization suffix always fetches Q4_K_M.

6 GB VRAM – Q4_K_M (default)

ollama pull llama3.1:8b

8 GB VRAM – Q5_K_M

ollama pull llama3.1:8b-instruct-q5_K_M

12 GB+ VRAM – Q8_0

ollama pull llama3.1:8b-instruct-q8_0

24 GB VRAM – 70B

ollama pull llama3.3:70b

70B on a single 24 GB card

Llama 3.3 70B at Q4_K_M requires approximately 43 GB of VRAM for weights alone. A single 24 GB card cannot hold it entirely. Ollama will offload the excess layers to system RAM automatically. Inference is still substantially faster than CPU-only operation, but it is not full GPU inference. Section 10 covers how to control and monitor the layer split.

Confirm the model is registered:

ollama list

Configuring Ollama

GPU-specific settings are applied through environment variables in a systemd override file. Do not edit the main service unit directly; override files survive package upgrades.

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep model loaded between requests
Environment="OLLAMA_KEEP_ALIVE=10m"

# Restrict API to localhost (default, but explicit)
Environment="OLLAMA_HOST=127.0.0.1:11434"

# Flash attention reduces VRAM KV cache usage by 40-60% on Turing and newer
Environment="OLLAMA_FLASH_ATTENTION=1"

# One model in VRAM at a time
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

GPU-relevant environment variables

Variable	Default	Purpose
`OLLAMA_FLASH_ATTENTION`	`0`	Enables Flash Attention, reducing VRAM usage for the KV cache by 40 to 60 percent on Turing (RTX 20xx) and newer. Enable on any supported card.
`OLLAMA_NUM_GPU`	All layers	Number of transformer layers to load on GPU. Controls the GPU/CPU split for partial offloading. See section 10.
`OLLAMA_GPU_MEMORY_FRACTION`	`1.0`	Fraction of available VRAM Ollama may use. Set to `0.85` on cards also driving a display to reserve headroom for the compositor.
`CUDA_VISIBLE_DEVICES`	All GPUs	Comma-separated GPU indices Ollama is allowed to use. On single-GPU systems this is unnecessary.
`OLLAMA_MAX_LOADED_MODELS`	`1`	Models held in VRAM simultaneously. Increasing this only makes sense if multiple models fit.

Card connected to a display? A GPU also running a desktop environment or display output reserves 200 to 400 MB of VRAM for the compositor. On a 6 GB card this is significant. Set OLLAMA_GPU_MEMORY_FRACTION=0.85 to prevent the model load from crowding out the display driver, which causes system instability.

Layer offloading for VRAM-constrained cards

When a model exceeds available VRAM, Ollama distributes transformer layers across GPU and CPU. GPU layers run at VRAM bandwidth speed. CPU layers run at system RAM speed with added PCIe transfer overhead at every layer boundary. Ollama’s automatic split is silent and sometimes miscalculated, particularly when the display driver is consuming VRAM that Ollama cannot account for.

Manual control via OLLAMA_NUM_GPU gives you an explicit and reproducible configuration.

Check the automatic split first

# Load the model with a quick prompt
ollama run llama3.1:8b "hello"

# Inspect the GPU/CPU split
ollama ps

Any CPU percentage means partial offloading is active. Determine how many layers the model has:

ollama show llama3.1:8b | grep -i 'layer\|param'

Manual control via Modelfile

cat > ~/ollama-models/Modelfile.llama31-gpu <<'EOF'
FROM llama3.1:8b

# Load 28 of 32 layers on GPU, remainder handled by CPU
# Adjust based on how much VRAM remains after model load
PARAMETER num_gpu 28

PARAMETER temperature    0.7
PARAMETER top_p          0.9
PARAMETER num_ctx        4096
PARAMETER repeat_penalty 1.1
EOF

ollama create llama31-gpu -f ~/ollama-models/Modelfile.llama31-gpu

Manual control via environment variable

To apply a layer limit globally across all models loaded by the service:

# Add to /etc/systemd/system/ollama.service.d/override.conf
# Set to 999 to attempt full GPU loading (Ollama caps at actual layer count)
# Reduce in increments of 4 if VRAM is insufficient
Environment="OLLAMA_NUM_GPU=999"

After editing the override file, always reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Partial offload performance penalty

Performance does not degrade linearly with the number of CPU layers. Even 4 CPU layers out of 32 can reduce throughput by 30 to 50 percent because every forward pass requires a synchronisation point at the GPU/CPU boundary. If full GPU fit is not achievable, maximise the number of layers on GPU rather than targeting an even split.

Creating a custom Modelfile

GPU inference opens up larger context windows without the system RAM penalty of CPU operation. The Modelfile below is tuned for a GPU setup with the context ceiling calibrated to a 12 GB card.

mkdir -p ~/ollama-models

cat > ~/ollama-models/Modelfile.llama31 <<'EOF'
# Adjust the FROM tag to match what you pulled in section 08
FROM llama3.1:8b-instruct-q8_0

SYSTEM """
You are a knowledgeable and precise technical assistant. Answer questions
directly and completely. When writing code, include comments explaining
non-obvious decisions. When uncertain, say so explicitly.
"""

PARAMETER temperature    0.7
PARAMETER top_p          0.9
# 16K context is viable on 12 GB+ VRAM with flash attention enabled
# Reduce to 4096 for 6 GB cards
PARAMETER num_ctx        16384
PARAMETER repeat_penalty 1.1
EOF

ollama create llama31-gpu -f ~/ollama-models/Modelfile.llama31

Context window and VRAM on GPU

On GPU, the KV cache for the context window lives in VRAM. A num_ctx of 16384 consumes approximately 1 GB of VRAM on an 8B model. With flash attention enabled this is reduced by 40 to 60 percent, making large context windows much more practical. On 6 GB cards, keep num_ctx at 4096 and enable OLLAMA_FLASH_ATTENTION=1 in the systemd override.

Testing the installation

Test 1 -> Service and API health

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

Example response:

active{"version":"0.x.x"}

Test 2 -> Single-shot inference with timing

On GPU the first token should arrive in under 5 seconds. Generation should be clearly faster than CPU-only inference.

time ollama run llama3.1:8b "Explain what a transformer attention mechanism does in two sentences."

Expected response:

response complete in under 10 seconds including model load time. Token generation at 30 to 80 tok/s depending on card and quantization.

Test 3 -> Confirm full GPU utilisation

This is the critical test. Run it immediately after inference while the model is still loaded:

ollama ps

Example response:

NAME ID SIZE PROCESSOR UNTIL llama3.1:8b a1b2c3d4e5f6 5.8 GB 100% GPU 9 minutes from now

100% GPU confirms the model is fully resident in VRAM. Any CPU percentage indicates partial offloading – address it using section 10 before continuing.

Test 4 -> Live VRAM monitoring during inference

Open a second terminal and run this while inference is active:

watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

Example response:

NVIDIA GeForce RTX 3060, 6142 MiB, 146 MiB, 97 %

High GPU utilisation (80 to 99 percent) during active generation confirms the GPU is doing the work. Utilisation drops near zero between tokens – the GPU works in bursts during each forward pass, not continuously.

Test 5 -> Multi-turn context retention

ollama run llama3.1:8b

Example conversation:

>>> My server has an RTX 3060 with 12 GB of VRAM. 
That gives you solid headroom for 8B models at Q8_0 and comfortable context windows. What are you planning to run on it?

>>> How much VRAM does my GPU have? 
Your RTX 3060 has 12 GB of VRAM.

The model recalling the VRAM figure from earlier in the session confirms multi-turn context is working. Exit with /bye or Ctrl+D.

Test 6 -> Custom model variant

ollama run llama31-gpu "Write a bash script that watches a directory and logs any new files created."

Expect clean, commented bash code generated at GPU speed, reflecting the system prompt instruction to explain non-obvious decisions.

All tests passing? Service active, model listed, ollama ps shows 100% GPU, nvidia-smi shows high utilisation during inference, multi-turn context works, and the custom variant produces well-commented output. The installation is complete.

Troubleshooting

Symptom	Likely cause	Fix
`nvidia-smi` not found after reboot	Driver did not load	Check `dmesg \| grep -i nvidia` for kernel errors. Verify Nouveau is blacklisted: `cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf`. If the file is missing, run `sudo update-initramfs -u` and reboot.
Ollama running but `ollama ps` shows CPU only	Ollama installed before the driver	Confirm `nvidia-smi` works, then reinstall Ollama: `curl -fsSL https://ollama.com/install.sh \| sh`
GPU visible in debug output but model runs on CPU	`ollama` system user missing from `render` group	`sudo usermod -aG render ollama && sudo systemctl restart ollama`
`ollama ps` shows partial CPU offload	Model does not fully fit in available VRAM	Use a lower quantization, reduce `num_ctx`, or set `OLLAMA_NUM_GPU` explicitly. See section 10.
Token speed 5 to 15 tok/s despite GPU detected	Silent partial offloading	Check `ollama ps`. Any CPU percentage in the PROCESSOR column means offloading is active.
VRAM exhaustion or crash mid-inference	Context window too large for available VRAM	Reduce `num_ctx` in Modelfile. Enable flash attention: `OLLAMA_FLASH_ATTENTION=1`.
Display corruption or system freeze during inference	GPU also driving display; VRAM overcommitted	Set `OLLAMA_GPU_MEMORY_FRACTION=0.85` in the systemd override.
GPU not detected after kernel upgrade	DKMS module not rebuilt for new kernel	`sudo apt install --reinstall nvidia-dkms-570` (replace with installed driver version). Then `sudo reboot`.
Model pull fails or stalls	Network interruption	Re-run `ollama pull` with the same tag. Downloads resume from the last completed chunk.

Diagnostic commands

# Full GPU status
nvidia-smi

# Live VRAM and utilisation monitoring
watch -n1 'nvidia-smi --query-gpu=name,memory.used,memory.free,utilization.gpu --format=csv,noheader'

# Check Nouveau is not loaded (output should be empty)
lsmod | grep nouveau

# Kernel messages for NVIDIA driver errors
sudo dmesg | grep -i nvidia | tail -20

# Ollama service logs
sudo journalctl -u ollama -n 50 --no-pager

# Force GPU detection debug output
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i 'cuda\|gpu\|vram'

# Confirm group membership for current user
groups

# Check DKMS module build status
dkms status

# Check current Ollama environment settings
sudo systemctl cat ollama | grep Environment

Editor’s note

Editor’s note: If for any reason you are limited by hardware, and your computing power boils down to CPU, >>this article<< may be of assistance.

Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04 (NVIDIA GPU)

Stack

Prerequisites

Table of Contents

Overview and architecture

The Llama 3.x family on GPU

VRAM tiers and model selection

Silent partial offloading

Confirm the GPU is visible on the PCI bus

Installing NVIDIA drivers

Secure Boot

Check what the tool recommends before installing

Install the recommended driver and reboot

Verify the driver loaded after reboot

Nouveau conflict

Installing Ollama

Order matters

Verifying GPU detection

Pulling Llama 3.1

6 GB VRAM – Q4_K_M (default)

8 GB VRAM – Q5_K_M

12 GB+ VRAM – Q8_0

24 GB VRAM – 70B

70B on a single 24 GB card

Configuring Ollama

GPU-relevant environment variables

Layer offloading for VRAM-constrained cards

Check the automatic split first

Manual control via Modelfile

Manual control via environment variable

Partial offload performance penalty

Creating a custom Modelfile

Context window and VRAM on GPU

Testing the installation

Test 1 -> Service and API health

Test 2 -> Single-shot inference with timing

Test 3 -> Confirm full GPU utilisation

Test 4 -> Live VRAM monitoring during inference

Test 5 -> Multi-turn context retention

Test 6 -> Custom model variant

Troubleshooting

Diagnostic commands

Editor’s note

Ivan Dabić

Ivan Dabić

Book a Discovery Call With Our Team

Subscribe to our blog

Confirm Your Email Address