Stack
- Ubuntu 24.04 LTS
- Ollama · Llama 3.1 8B
- CPU-only
Prerequisites and hardware requirements
- Ubuntu 24.04 LTS, amd64 or arm64
- 16 GB RAM minimum for comfortable CPU inference (8 GB is the hard floor, but expect swapping)
- 4-core CPU with AVX2 support (Intel Haswell 2013+, AMD Ryzen any generation)
- 10 GB free disk space (model is ~4.9 GB; allow headroom for OS and temp files)
sudoaccess and internet connectivity
Table of Contents
- Overview and architecture
- The Llama 3.x model family
- Prerequisites and hardware requirements
- Installing Ollama on Ubuntu
- Verifying the service
- Pulling Llama 3.1 8B
- Configuring Ollama
- Creating a custom Modelfile
- Testing the installation
- Troubleshooting
Overview and architecture
This guide explains how to run Llama on Ubuntu, specifically, it covers running Meta’s Llama 3.1 8B locally on Ubuntu 24.04 using Ollama as the runtime and management layer. The entire stack runs on CPU with no GPU or proprietary drivers required.
Ollama handles model download, storage, quantization selection, and serving. Llama 3.1 provides the inference capabilities. Understanding what each layer does helps when diagnosing problems:
| Component | Role | Installed to |
|---|---|---|
| Ollama binary | CLI and server process. Downloads GGUF model files, serves the REST API on 127.0.0.1:11434, and manages model lifecycle (load, unload, keep-alive). Built on top of llama.cpp. | /usr/local/bin/ollama |
| Ollama systemd service | Keeps the Ollama server process running across reboots. Created automatically by the installer. | /etc/systemd/system/ollama.service |
| Llama 3.1 8B weights | GGUF-format model file at Q4_K_M quantization. ~4.9 GB on disk. Loaded into RAM on first inference request. | /usr/share/ollama/.ollama/models/ |
What is llama.cpp? Ollama uses llama.cpp under the hood as the inference engine. llama.cpp is a C++ implementation of the Llama model architecture optimised for CPU inference, including AVX2/AVX-512 vectorisation and NUMA-aware memory layout. You do not need to interact with llama.cpp directly (Ollama abstracts it entirely), but it is why CPU-only inference is viable at all.
The Llama 3.x model family
Meta has released three generations of Llama 3 models. Not every size exists in every generation, which causes confusion when choosing a tag. The table below covers the variants available in Ollama and their CPU-only viability:
| Model tag | Parameters | Disk (Q4_K_M) | Min RAM | CPU-only viable? |
|---|---|---|---|---|
llama3.2:3b | 3B | ~2.0 GB | 4 GB | Yes – fast, lightweight, edge/dev use |
llama3.1:8b | 8B | ~4.9 GB | 16 GB | Yes – recommended default |
llama3.3:70b | 70B | ~43 GB | 64 GB+ | No – impractically slow on CPU |
There is no Llama 3.2 8B variant and no Llama 3.3 8B. The 8B slot in the Llama 3.x family belongs exclusively to Llama 3.1. This guide uses llama3.1:8b throughout. Models differences:
llama3.1:8b (This guide’s version)
General chat, instruction following, tool use, 128K context. Multilingual. Best balance of quality and CPU performance.
~4.9 GB · 16 GB RAM recommended
llama3.2:3b
Lighter alternative when RAM is constrained or response speed is the priority over quality. Supports 8 languages.
~2.0 GB · 4 GB RAM minimum
llama3.3:70b
Flagship quality model. Practical only on systems with 64 GB+ RAM and even then inference is slow without a GPU.
~43 GB · GPU strongly recommended
Why not llama3.2 or llama3.3 for the 8B slot? At the time of writing this, Meta did not release an 8B model in the 3.2 generation. That generation focused on small (1B, 3B) and vision variants. Llama 3.3 jumped straight to 70B. Llama 3.1 8B therefore remains the current 8B-class Llama model in the Ollama library, and it is actively maintained.
Verify AVX2 support
Ollama’s CPU inference path benefits significantly from AVX2. Confirm it before continuing:
grep -o 'avx2' /proc/cpuinfo | head -1No output means AVX2 is absent. Ollama will still run but inference will be noticeably slower. If your CPU predates 2013 or is a low-power Atom/Celeron variant, AVX2 may be missing.
Install prerequisites
sudo apt update
sudo apt install -y curl zstdInstalling Ollama
The official installer script handles everything: binary placement, system user creation, and systemd service registration.
Inspect before running To review the installer before executing it: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh, then sh install.sh when ready.
curl -fsSL https://ollama.com/install.sh | shThe installer prints its actions as it runs.
Confirm the binary is reachable:
ollama --versionWhat the installer creates A system user named ollama (no login shell) owns the service process. When running as a service, model files are stored under /usr/share/ollama/.ollama/models/. The installer adds your current user to the ollama group so you can run CLI commands without sudo. Log out and back in if the group membership does not take effect immediately.
Verifying the service
Check that the systemd unit is active and enabled to start on boot:
sudo systemctl status ollama● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
Active: active (running) since ...
Main PID: 12345 (ollama)
Tasks: 9
Memory: 30.2M
CPU: 210ms
CGroup: /system.slice/ollama.service
└─12345 /usr/local/bin/ollama serve
Verify the REST API is responding:
curl -fsS http://127.0.0.1:11434/api/versionIf either check fails, restart and inspect the logs:
sudo systemctl restart ollama
sudo journalctl -u ollama -n 50 --no-pagerPulling Llama 3.1 8B
The pull command downloads the model from the Ollama library and caches it locally. The default tag for llama3.1:8b fetches Q4_K_M quantization, which is the right choice for CPU-only inference: it halves the size compared to full precision while keeping quality loss negligible for most tasks.
ollama pull llama3.1:8bThe download is approximately 4.9 GB. Progress is shown inline and resumes automatically if interrupted.
Confirm the model is registered locally:
ollama listQuantization options
If you need a different quantization than the default Q4_K_M, pull a specific tag. Useful in two scenarios: you have extra RAM and want higher quality (Q8_0), or you are severely RAM-constrained and need a smaller footprint (Q2_K, though quality degrades noticeably).
# Higher quality, higher RAM usage (~8.5 GB on disk)
ollama pull llama3.1:8b-instruct-q8_0
# Default - recommended for most CPU-only setups
ollama pull llama3.1:8b
# Smaller footprint (~2.7 GB), visible quality trade-off
ollama pull llama3.1:8b-instruct-q2_KStick with Q4_K_M For CPU-only inference, Q4_K_M is the right default. The quality difference between Q4_K_M and Q8_0 is small enough that most real-world tasks will not reveal it, but Q8_0 uses nearly double the RAM. Only move to Q8_0 if you have 32+ GB RAM and a specific quality-sensitive workload.
Configuring Ollama
Ollama is configured through environment variables. The correct place to set them on a systemd installation is a drop-in override file, not your shell profile. The shell profile only affects interactive sessions, not the background service process.
Key environment variables
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Bind address for the REST API |
OLLAMA_MODELS | /usr/share/ollama/.ollama/models | Directory where downloaded model files are stored |
OLLAMA_NUM_PARALLEL | 1 | Concurrent inference requests; keep at 1 for CPU-only to avoid RAM contention |
OLLAMA_MAX_LOADED_MODELS | 1 | Models held in memory simultaneously |
OLLAMA_KEEP_ALIVE | 5m | How long to keep a model loaded after the last request before unloading it |
OLLAMA_NUM_THREADS | All available cores | CPU threads used for inference. Defaults to all cores; tuning this per your workload can improve throughput |
Applying configuration via systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep the model warm between requests (reduces reload latency)
Environment="OLLAMA_KEEP_ALIVE=10m"
# Restrict to localhost - remove or change for network access
Environment="OLLAMA_HOST=127.0.0.1:11434"
# One model in memory at a time - appropriate for CPU-only systems
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOFReload and restart to apply:
sudo systemctl daemon-reload
sudo systemctl restart ollamaTuning CPU thread count
By default Ollama uses all available cores. On systems running other services alongside Ollama, leaving one or two cores free prevents the inference load from starving the OS and other processes:
# Check how many cores are available
nproc
# Add to your override.conf if you want to reserve cores for other workloads
# Replace 6 with nproc minus 2
Environment="OLLAMA_NUM_THREADS=6"Network exposure The default bind address 127.0.0.1 means Ollama is not reachable from other machines on the network. If you need remote access, set OLLAMA_HOST=0.0.0.0:11434 and enforce access control at the firewall level with ufw. Ollama has no built-in authentication, so make sure to never expose port 11434 directly to the public internet.
Creating a custom Modelfile
A Modelfile lets you bake a persistent system prompt and inference parameters into a named model variant. This is useful when you want consistent behaviour across sessions without repeating yourself in every prompt.
mkdir -p ~/ollama-models
cat > ~/ollama-models/Modelfile.llama31 <<'EOF'
# Base model
FROM llama3.1:8b
# System prompt applied to every conversation
SYSTEM """
You are a knowledgeable and precise technical assistant. Answer questions
directly and completely. When providing code, include comments that explain
non-obvious choices. When unsure, say so explicitly rather than guessing.
"""
# Inference parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
EOFBuild the custom variant:
ollama create llama31-custom -f ~/ollama-models/Modelfile.llama31ollama list NAME ID SIZE MODIFIED llama31-custom:latest a1b2c3d4e5f6 4.9 GB X seconds ago llama3.1:8b b2c3d4e5f6a1 4.9 GB Y minutes ago
Key Modelfile parameters
| Parameter | Typical range | Effect |
|---|---|---|
temperature | 0.0 – 1.0 | Higher values increase output variety; lower values make responses more deterministic. 0.7 is a solid default for general use. |
top_p | 0.7 – 0.95 | Nucleus sampling threshold. Restricts token selection to the top probability mass. Lower values reduce randomness further. |
num_ctx | 2048 – 131072 | Context window in tokens. Llama 3.1 supports up to 128K, but larger values consume proportionally more RAM. 8192 is a practical default for CPU-only setups. |
repeat_penalty | 1.0 – 1.3 | Penalises the model for repeating tokens it has already produced. Values above 1.15 can harm coherence on long responses. |
Testing the installation
The following tests progress from basic connectivity through to functional validation. Run them in order.
Test 1 -> Service and API health
sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/versionactive{"version":"0.x.x"}
Test 2 -> Model is registered
ollama listExpected:
llama3.1:8b appears in the output with a size around 4.9 GB.
Test 3 -> Single-shot inference
Send a non-trivial prompt that requires actual reasoning, not just pattern completion. This confirms the model loads correctly and produces coherent output:
ollama run llama3.1:8b "Explain the difference between a process and a thread in Linux. Be concise."On a CPU-only system, expect the first token in 10–30 seconds as the model loads into RAM. Once loaded, tokens stream at roughly 5–15 per second depending on hardware. Any coherent response means the model is functional.
Monitor RAM during the first load Run watch -n1 free -h in a second terminal while inference runs. This confirms the model fits in RAM and the system is not swapping. If you see swap growing, close other applications before continuing.
Test 4 -> Multi-turn context retention
Open an interactive session and verify the model correctly tracks information across turns:
ollama run llama3.1:8b>>> The server I am configuring runs Ubuntu 24.04 with 32 GB RAM.
Understood. A well-provisioned Ubuntu 24.04 server with 32 GB RAM gives you comfortable headroom for most workloads. What are you configuring?
>>> How much RAM does my server have?
Your server has 32 GB of RAM, as you mentioned.
The model correctly recalling the RAM figure from earlier in the conversation confirms multi-turn context is working. Exit with /bye or Ctrl+D.
Test 5 -> Custom variant
If you built the custom Modelfile in section 08, verify it runs and reflects the system prompt:
ollama run llama31-custom "Write a bash function that checks if a port is open on a remote host."Expect clean, commented bash code reflecting the system prompt instruction to comment non-obvious choices.
Test 6 -> Process state
After any inference, check the model’s loaded state and resource usage:
ollama psNAME ID SIZE PROCESSOR UNTIL llama3.1:8b a1b2c3d4e5f6 6.0 GB 100% CPU X minutes from now
PROCESSOR: 100% CPU confirms no GPU is in use. UNTIL shows when the model will be unloaded from RAM based on the OLLAMA_KEEP_ALIVE value set in section 07. After this timer expires, the next request will reload the model from disk, adding initial latency.
All tests passing? If the service is active, the model is listed, single-shot and interactive prompts both produce coherent output, and ollama ps shows the model loaded on CPU, the installation is complete and correctly configured.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Connection refused on port 11434 | Service not running | sudo systemctl start ollama then check journalctl -u ollama -n 30 |
ollama command not found after install | Group membership not applied to current session | Log out and back in, or run newgrp ollama. Binary is at /usr/local/bin/ollama. |
| Model pull stalls mid-download | Network interruption | Re-run ollama pull llama3.1:8b – downloads resume from the last completed chunk |
| Inference runs but swap grows steadily | Insufficient free RAM for model + KV cache | Close other processes. Reduce context window in Modelfile: PARAMETER num_ctx 2048. Consider llama3.2:3b instead. |
| Token generation under 2 tokens/sec | Swap activity or AVX2 absent | Check free -h during inference. Check AVX2: grep -o 'avx2' /proc/cpuinfo | head -1 |
ollama create fails with model not found | Base model not yet pulled | Run ollama pull llama3.1:8b before ollama create |
| Configuration changes not taking effect | systemd not reloaded after override edit | sudo systemctl daemon-reload && sudo systemctl restart ollama |
| Model unloads between every request | OLLAMA_KEEP_ALIVE too short | Set Environment="OLLAMA_KEEP_ALIVE=10m" in the systemd override (section 07) |
Diagnostic commands
# Stream service logs in real time
sudo journalctl -u ollama -f
# Check disk space available for model storage
df -h /usr/share/ollama
# Watch RAM usage while inference runs
watch -n1 'free -h'
# Confirm AVX2 instruction set is available
grep -o 'avx2' /proc/cpuinfo | head -1
# Show the installed service unit
cat /etc/systemd/system/ollama.service
# Remove a model to recover disk space
ollama rm llama3.1:8b