Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04

Stack

Prerequisites and hardware requirements

Ubuntu 24.04 LTS, amd64 or arm64
16 GB RAM minimum for comfortable CPU inference (8 GB is the hard floor, but expect swapping)
4-core CPU with AVX2 support (Intel Haswell 2013+, AMD Ryzen any generation)
10 GB free disk space (model is ~4.9 GB; allow headroom for OS and temp files)
sudo access and internet connectivity

Overview and architecture
The Llama 3.x model family
Prerequisites and hardware requirements
Installing Ollama on Ubuntu
Verifying the service
Pulling Llama 3.1 8B
Configuring Ollama
Creating a custom Modelfile
Testing the installation
Troubleshooting

Overview and architecture

This guide explains how to run Llama on Ubuntu, specifically, it covers running Meta’s Llama 3.1 8B locally on Ubuntu 24.04 using Ollama as the runtime and management layer. The entire stack runs on CPU with no GPU or proprietary drivers required.

Ollama handles model download, storage, quantization selection, and serving. Llama 3.1 provides the inference capabilities. Understanding what each layer does helps when diagnosing problems:

Component	Role	Installed to
Ollama binary	CLI and server process. Downloads GGUF model files, serves the REST API on `127.0.0.1:11434`, and manages model lifecycle (load, unload, keep-alive). Built on top of llama.cpp.	`/usr/local/bin/ollama`
Ollama systemd service	Keeps the Ollama server process running across reboots. Created automatically by the installer.	`/etc/systemd/system/ollama.service`
Llama 3.1 8B weights	GGUF-format model file at Q4_K_M quantization. ~4.9 GB on disk. Loaded into RAM on first inference request.	`/usr/share/ollama/.ollama/models/`

What is llama.cpp? Ollama uses llama.cpp under the hood as the inference engine. llama.cpp is a C++ implementation of the Llama model architecture optimised for CPU inference, including AVX2/AVX-512 vectorisation and NUMA-aware memory layout. You do not need to interact with llama.cpp directly (Ollama abstracts it entirely), but it is why CPU-only inference is viable at all.

The Llama 3.x model family

Meta has released three generations of Llama 3 models. Not every size exists in every generation, which causes confusion when choosing a tag. The table below covers the variants available in Ollama and their CPU-only viability:

Model tag	Parameters	Disk (Q4_K_M)	Min RAM	CPU-only viable?
`llama3.2:3b`	3B	~2.0 GB	4 GB	Yes – fast, lightweight, edge/dev use
`llama3.1:8b`	8B	~4.9 GB	16 GB	Yes – recommended default
`llama3.3:70b`	70B	~43 GB	64 GB+	No – impractically slow on CPU

There is no Llama 3.2 8B variant and no Llama 3.3 8B. The 8B slot in the Llama 3.x family belongs exclusively to Llama 3.1. This guide uses llama3.1:8b throughout. Models differences:

llama3.1:8b (This guide’s version)
General chat, instruction following, tool use, 128K context. Multilingual. Best balance of quality and CPU performance.
~4.9 GB · 16 GB RAM recommended

llama3.2:3b
Lighter alternative when RAM is constrained or response speed is the priority over quality. Supports 8 languages.
~2.0 GB · 4 GB RAM minimum

llama3.3:70b
Flagship quality model. Practical only on systems with 64 GB+ RAM and even then inference is slow without a GPU.
~43 GB · GPU strongly recommended

Why not llama3.2 or llama3.3 for the 8B slot? At the time of writing this, Meta did not release an 8B model in the 3.2 generation. That generation focused on small (1B, 3B) and vision variants. Llama 3.3 jumped straight to 70B. Llama 3.1 8B therefore remains the current 8B-class Llama model in the Ollama library, and it is actively maintained.

Verify AVX2 support

Ollama’s CPU inference path benefits significantly from AVX2. Confirm it before continuing:

grep -o 'avx2' /proc/cpuinfo | head -1

No output means AVX2 is absent. Ollama will still run but inference will be noticeably slower. If your CPU predates 2013 or is a low-power Atom/Celeron variant, AVX2 may be missing.

Install prerequisites

sudo apt update
sudo apt install -y curl zstd

Installing Ollama

The official installer script handles everything: binary placement, system user creation, and systemd service registration.

Inspect before running To review the installer before executing it: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh, then sh install.sh when ready.

curl -fsSL https://ollama.com/install.sh | sh

The installer prints its actions as it runs.

Confirm the binary is reachable:

ollama --version

What the installer creates A system user named ollama (no login shell) owns the service process. When running as a service, model files are stored under /usr/share/ollama/.ollama/models/. The installer adds your current user to the ollama group so you can run CLI commands without sudo. Log out and back in if the group membership does not take effect immediately.

Verifying the service

Check that the systemd unit is active and enabled to start on boot:

sudo systemctl status ollama

● ollama.service - Ollama Service 
      Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled) 
      Active: active (running) since ... 
    Main PID: 12345 (ollama) 
       Tasks: 9 
      Memory: 30.2M 
         CPU: 210ms 
      CGroup: /system.slice/ollama.service 
              └─12345 /usr/local/bin/ollama serve

Verify the REST API is responding:

curl -fsS http://127.0.0.1:11434/api/version

If either check fails, restart and inspect the logs:

sudo systemctl restart ollama
sudo journalctl -u ollama -n 50 --no-pager

Pulling Llama 3.1 8B

The pull command downloads the model from the Ollama library and caches it locally. The default tag for llama3.1:8b fetches Q4_K_M quantization, which is the right choice for CPU-only inference: it halves the size compared to full precision while keeping quality loss negligible for most tasks.

ollama pull llama3.1:8b

The download is approximately 4.9 GB. Progress is shown inline and resumes automatically if interrupted.

Confirm the model is registered locally:

ollama list

Quantization options

If you need a different quantization than the default Q4_K_M, pull a specific tag. Useful in two scenarios: you have extra RAM and want higher quality (Q8_0), or you are severely RAM-constrained and need a smaller footprint (Q2_K, though quality degrades noticeably).

# Higher quality, higher RAM usage (~8.5 GB on disk)
ollama pull llama3.1:8b-instruct-q8_0

# Default - recommended for most CPU-only setups
ollama pull llama3.1:8b

# Smaller footprint (~2.7 GB), visible quality trade-off
ollama pull llama3.1:8b-instruct-q2_K

Stick with Q4_K_M For CPU-only inference, Q4_K_M is the right default. The quality difference between Q4_K_M and Q8_0 is small enough that most real-world tasks will not reveal it, but Q8_0 uses nearly double the RAM. Only move to Q8_0 if you have 32+ GB RAM and a specific quality-sensitive workload.

Configuring Ollama

Ollama is configured through environment variables. The correct place to set them on a systemd installation is a drop-in override file, not your shell profile. The shell profile only affects interactive sessions, not the background service process.

Key environment variables

Variable	Default	Purpose
`OLLAMA_HOST`	`127.0.0.1:11434`	Bind address for the REST API
`OLLAMA_MODELS`	`/usr/share/ollama/.ollama/models`	Directory where downloaded model files are stored
`OLLAMA_NUM_PARALLEL`	`1`	Concurrent inference requests; keep at 1 for CPU-only to avoid RAM contention
`OLLAMA_MAX_LOADED_MODELS`	`1`	Models held in memory simultaneously
`OLLAMA_KEEP_ALIVE`	`5m`	How long to keep a model loaded after the last request before unloading it
`OLLAMA_NUM_THREADS`	All available cores	CPU threads used for inference. Defaults to all cores; tuning this per your workload can improve throughput

Applying configuration via systemd override

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep the model warm between requests (reduces reload latency)
Environment="OLLAMA_KEEP_ALIVE=10m"

# Restrict to localhost - remove or change for network access
Environment="OLLAMA_HOST=127.0.0.1:11434"

# One model in memory at a time - appropriate for CPU-only systems
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

Reload and restart to apply:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Tuning CPU thread count

By default Ollama uses all available cores. On systems running other services alongside Ollama, leaving one or two cores free prevents the inference load from starving the OS and other processes:

# Check how many cores are available
nproc

# Add to your override.conf if you want to reserve cores for other workloads
# Replace 6 with nproc minus 2
Environment="OLLAMA_NUM_THREADS=6"

Network exposure The default bind address 127.0.0.1 means Ollama is not reachable from other machines on the network. If you need remote access, set OLLAMA_HOST=0.0.0.0:11434 and enforce access control at the firewall level with ufw. Ollama has no built-in authentication, so make sure to never expose port 11434 directly to the public internet.

Creating a custom Modelfile

A Modelfile lets you bake a persistent system prompt and inference parameters into a named model variant. This is useful when you want consistent behaviour across sessions without repeating yourself in every prompt.

mkdir -p ~/ollama-models

cat > ~/ollama-models/Modelfile.llama31 <<'EOF'
# Base model
FROM llama3.1:8b

# System prompt applied to every conversation
SYSTEM """
You are a knowledgeable and precise technical assistant. Answer questions
directly and completely. When providing code, include comments that explain
non-obvious choices. When unsure, say so explicitly rather than guessing.
"""

# Inference parameters
PARAMETER temperature    0.7
PARAMETER top_p          0.9
PARAMETER num_ctx        8192
PARAMETER repeat_penalty 1.1
EOF

Build the custom variant:

ollama create llama31-custom -f ~/ollama-models/Modelfile.llama31

ollama list

    NAME ID SIZE MODIFIED llama31-custom:latest a1b2c3d4e5f6 4.9 GB X seconds ago llama3.1:8b b2c3d4e5f6a1 4.9 GB Y minutes ago

Key Modelfile parameters

Parameter	Typical range	Effect
`temperature`	0.0 – 1.0	Higher values increase output variety; lower values make responses more deterministic. 0.7 is a solid default for general use.
`top_p`	0.7 – 0.95	Nucleus sampling threshold. Restricts token selection to the top probability mass. Lower values reduce randomness further.
`num_ctx`	2048 – 131072	Context window in tokens. Llama 3.1 supports up to 128K, but larger values consume proportionally more RAM. 8192 is a practical default for CPU-only setups.
`repeat_penalty`	1.0 – 1.3	Penalises the model for repeating tokens it has already produced. Values above 1.15 can harm coherence on long responses.

Testing the installation

The following tests progress from basic connectivity through to functional validation. Run them in order.

Test 1 -> Service and API health

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version

active{"version":"0.x.x"}

Test 2 -> Model is registered

ollama list

Expected:

llama3.1:8b appears in the output with a size around 4.9 GB.

Test 3 -> Single-shot inference

Send a non-trivial prompt that requires actual reasoning, not just pattern completion. This confirms the model loads correctly and produces coherent output:

ollama run llama3.1:8b "Explain the difference between a process and a thread in Linux. Be concise."

On a CPU-only system, expect the first token in 10–30 seconds as the model loads into RAM. Once loaded, tokens stream at roughly 5–15 per second depending on hardware. Any coherent response means the model is functional.

Monitor RAM during the first load Run watch -n1 free -h in a second terminal while inference runs. This confirms the model fits in RAM and the system is not swapping. If you see swap growing, close other applications before continuing.

Test 4 -> Multi-turn context retention

Open an interactive session and verify the model correctly tracks information across turns:

ollama run llama3.1:8b

>>> The server I am configuring runs Ubuntu 24.04 with 32 GB RAM.
Understood. A well-provisioned Ubuntu 24.04 server with 32 GB RAM gives you comfortable headroom for most workloads. What are you configuring?

>>> How much RAM does my server have?
Your server has 32 GB of RAM, as you mentioned.

The model correctly recalling the RAM figure from earlier in the conversation confirms multi-turn context is working. Exit with /bye or Ctrl+D.

Test 5 -> Custom variant

If you built the custom Modelfile in section 08, verify it runs and reflects the system prompt:

ollama run llama31-custom "Write a bash function that checks if a port is open on a remote host."

Expect clean, commented bash code reflecting the system prompt instruction to comment non-obvious choices.

Test 6 -> Process state

After any inference, check the model’s loaded state and resource usage:

ollama ps

NAME ID SIZE PROCESSOR UNTIL llama3.1:8b a1b2c3d4e5f6 6.0 GB 100% CPU X minutes from now

PROCESSOR: 100% CPU confirms no GPU is in use. UNTIL shows when the model will be unloaded from RAM based on the OLLAMA_KEEP_ALIVE value set in section 07. After this timer expires, the next request will reload the model from disk, adding initial latency.

All tests passing? If the service is active, the model is listed, single-shot and interactive prompts both produce coherent output, and ollama ps shows the model loaded on CPU, the installation is complete and correctly configured.

Troubleshooting

Symptom	Likely cause	Fix
Connection refused on port 11434	Service not running	`sudo systemctl start ollama` then check `journalctl -u ollama -n 30`
`ollama` command not found after install	Group membership not applied to current session	Log out and back in, or run `newgrp ollama`. Binary is at `/usr/local/bin/ollama`.
Model pull stalls mid-download	Network interruption	Re-run `ollama pull llama3.1:8b` – downloads resume from the last completed chunk
Inference runs but swap grows steadily	Insufficient free RAM for model + KV cache	Close other processes. Reduce context window in Modelfile: `PARAMETER num_ctx 2048`. Consider `llama3.2:3b` instead.
Token generation under 2 tokens/sec	Swap activity or AVX2 absent	Check `free -h` during inference. Check AVX2: `grep -o 'avx2' /proc/cpuinfo \| head -1`
`ollama create` fails with model not found	Base model not yet pulled	Run `ollama pull llama3.1:8b` before `ollama create`
Configuration changes not taking effect	systemd not reloaded after override edit	`sudo systemctl daemon-reload && sudo systemctl restart ollama`
Model unloads between every request	`OLLAMA_KEEP_ALIVE` too short	Set `Environment="OLLAMA_KEEP_ALIVE=10m"` in the systemd override (section 07)

Diagnostic commands

# Stream service logs in real time
sudo journalctl -u ollama -f

# Check disk space available for model storage
df -h /usr/share/ollama

# Watch RAM usage while inference runs
watch -n1 'free -h'

# Confirm AVX2 instruction set is available
grep -o 'avx2' /proc/cpuinfo | head -1

# Show the installed service unit
cat /etc/systemd/system/ollama.service

# Remove a model to recover disk space
ollama rm llama3.1:8b

Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04

Stack

Prerequisites and hardware requirements

Table of Contents

Overview and architecture

The Llama 3.x model family

Verify AVX2 support

Install prerequisites

Installing Ollama

Verifying the service

Pulling Llama 3.1 8B

Quantization options

Configuring Ollama

Key environment variables

Applying configuration via systemd override

Tuning CPU thread count

Creating a custom Modelfile

Key Modelfile parameters

Testing the installation

Test 1 -> Service and API health

Test 2 -> Model is registered

Test 3 -> Single-shot inference

Test 4 -> Multi-turn context retention

Test 5 -> Custom variant

Test 6 -> Process state

Troubleshooting

Diagnostic commands

Ivan Dabić

Ivan Dabić

Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04

Stack

Prerequisites and hardware requirements

Table of Contents

Overview and architecture

The Llama 3.x model family

Verify AVX2 support

Install prerequisites

Installing Ollama

Verifying the service

Pulling Llama 3.1 8B

Quantization options

Configuring Ollama

Key environment variables

Applying configuration via systemd override

Tuning CPU thread count

Creating a custom Modelfile

Key Modelfile parameters

Testing the installation

Test 1 -> Service and API health

Test 2 -> Model is registered

Test 3 -> Single-shot inference

Test 4 -> Multi-turn context retention

Test 5 -> Custom variant

Test 6 -> Process state

Troubleshooting

Diagnostic commands

Ivan Dabić

Ivan Dabić

Subscribe to our blog

Confirm Your Email Address