Installing and Running Llama 3.1 via Ollama on Ubuntu 24.04

Stack

Prerequisites and hardware requirements

  • Ubuntu 24.04 LTS, amd64 or arm64
  • 16 GB RAM minimum for comfortable CPU inference (8 GB is the hard floor, but expect swapping)
  • 4-core CPU with AVX2 support (Intel Haswell 2013+, AMD Ryzen any generation)
  • 10 GB free disk space (model is ~4.9 GB; allow headroom for OS and temp files)
  • sudo access and internet connectivity

Table of Contents

  1. Overview and architecture
  2. The Llama 3.x model family
  3. Prerequisites and hardware requirements
  4. Installing Ollama on Ubuntu
  5. Verifying the service
  6. Pulling Llama 3.1 8B
  7. Configuring Ollama
  8. Creating a custom Modelfile
  9. Testing the installation
  10. Troubleshooting

Overview and architecture

This guide explains how to run Llama on Ubuntu, specifically, it covers running Meta’s Llama 3.1 8B locally on Ubuntu 24.04 using Ollama as the runtime and management layer. The entire stack runs on CPU with no GPU or proprietary drivers required.

Ollama handles model download, storage, quantization selection, and serving. Llama 3.1 provides the inference capabilities. Understanding what each layer does helps when diagnosing problems:

ComponentRoleInstalled to
Ollama binaryCLI and server process. Downloads GGUF model files, serves the REST API on 127.0.0.1:11434, and manages model lifecycle (load, unload, keep-alive). Built on top of llama.cpp./usr/local/bin/ollama
Ollama systemd serviceKeeps the Ollama server process running across reboots. Created automatically by the installer./etc/systemd/system/ollama.service
Llama 3.1 8B weightsGGUF-format model file at Q4_K_M quantization. ~4.9 GB on disk. Loaded into RAM on first inference request./usr/share/ollama/.ollama/models/

What is llama.cpp? Ollama uses llama.cpp under the hood as the inference engine. llama.cpp is a C++ implementation of the Llama model architecture optimised for CPU inference, including AVX2/AVX-512 vectorisation and NUMA-aware memory layout. You do not need to interact with llama.cpp directly (Ollama abstracts it entirely), but it is why CPU-only inference is viable at all.

The Llama 3.x model family

Meta has released three generations of Llama 3 models. Not every size exists in every generation, which causes confusion when choosing a tag. The table below covers the variants available in Ollama and their CPU-only viability:

Model tagParametersDisk (Q4_K_M)Min RAMCPU-only viable?
llama3.2:3b3B~2.0 GB4 GBYes – fast, lightweight, edge/dev use
llama3.1:8b8B~4.9 GB16 GBYes – recommended default
llama3.3:70b70B~43 GB64 GB+No – impractically slow on CPU

There is no Llama 3.2 8B variant and no Llama 3.3 8B. The 8B slot in the Llama 3.x family belongs exclusively to Llama 3.1. This guide uses llama3.1:8b throughout. Models differences:

llama3.1:8b (This guide’s version)
General chat, instruction following, tool use, 128K context. Multilingual. Best balance of quality and CPU performance.
~4.9 GB · 16 GB RAM recommended

llama3.2:3b
Lighter alternative when RAM is constrained or response speed is the priority over quality. Supports 8 languages.
~2.0 GB · 4 GB RAM minimum

llama3.3:70b
Flagship quality model. Practical only on systems with 64 GB+ RAM and even then inference is slow without a GPU.
~43 GB · GPU strongly recommended

Why not llama3.2 or llama3.3 for the 8B slot? At the time of writing this, Meta did not release an 8B model in the 3.2 generation. That generation focused on small (1B, 3B) and vision variants. Llama 3.3 jumped straight to 70B. Llama 3.1 8B therefore remains the current 8B-class Llama model in the Ollama library, and it is actively maintained.

Verify AVX2 support

Ollama’s CPU inference path benefits significantly from AVX2. Confirm it before continuing:

grep -o 'avx2' /proc/cpuinfo | head -1

No output means AVX2 is absent. Ollama will still run but inference will be noticeably slower. If your CPU predates 2013 or is a low-power Atom/Celeron variant, AVX2 may be missing.

Install prerequisites

sudo apt update
sudo apt install -y curl zstd

Installing Ollama

The official installer script handles everything: binary placement, system user creation, and systemd service registration.

Inspect before running To review the installer before executing it: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh, then sh install.sh when ready.

curl -fsSL https://ollama.com/install.sh | sh

The installer prints its actions as it runs.

Confirm the binary is reachable:

ollama --version

What the installer creates A system user named ollama (no login shell) owns the service process. When running as a service, model files are stored under /usr/share/ollama/.ollama/models/. The installer adds your current user to the ollama group so you can run CLI commands without sudo. Log out and back in if the group membership does not take effect immediately.

Verifying the service

Check that the systemd unit is active and enabled to start on boot:

sudo systemctl status ollama
● ollama.service - Ollama Service 
      Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled) 
      Active: active (running) since ... 
    Main PID: 12345 (ollama) 
       Tasks: 9 
      Memory: 30.2M 
         CPU: 210ms 
      CGroup: /system.slice/ollama.service 
              └─12345 /usr/local/bin/ollama serve

Verify the REST API is responding:

curl -fsS http://127.0.0.1:11434/api/version

If either check fails, restart and inspect the logs:

sudo systemctl restart ollama
sudo journalctl -u ollama -n 50 --no-pager

Pulling Llama 3.1 8B

The pull command downloads the model from the Ollama library and caches it locally. The default tag for llama3.1:8b fetches Q4_K_M quantization, which is the right choice for CPU-only inference: it halves the size compared to full precision while keeping quality loss negligible for most tasks.

ollama pull llama3.1:8b

The download is approximately 4.9 GB. Progress is shown inline and resumes automatically if interrupted.

Confirm the model is registered locally:

ollama list

Quantization options

If you need a different quantization than the default Q4_K_M, pull a specific tag. Useful in two scenarios: you have extra RAM and want higher quality (Q8_0), or you are severely RAM-constrained and need a smaller footprint (Q2_K, though quality degrades noticeably).

# Higher quality, higher RAM usage (~8.5 GB on disk)
ollama pull llama3.1:8b-instruct-q8_0

# Default - recommended for most CPU-only setups
ollama pull llama3.1:8b

# Smaller footprint (~2.7 GB), visible quality trade-off
ollama pull llama3.1:8b-instruct-q2_K

Stick with Q4_K_M For CPU-only inference, Q4_K_M is the right default. The quality difference between Q4_K_M and Q8_0 is small enough that most real-world tasks will not reveal it, but Q8_0 uses nearly double the RAM. Only move to Q8_0 if you have 32+ GB RAM and a specific quality-sensitive workload.

Configuring Ollama

Ollama is configured through environment variables. The correct place to set them on a systemd installation is a drop-in override file, not your shell profile. The shell profile only affects interactive sessions, not the background service process.

Key environment variables

VariableDefaultPurpose
OLLAMA_HOST127.0.0.1:11434Bind address for the REST API
OLLAMA_MODELS/usr/share/ollama/.ollama/modelsDirectory where downloaded model files are stored
OLLAMA_NUM_PARALLEL1Concurrent inference requests; keep at 1 for CPU-only to avoid RAM contention
OLLAMA_MAX_LOADED_MODELS1Models held in memory simultaneously
OLLAMA_KEEP_ALIVE5mHow long to keep a model loaded after the last request before unloading it
OLLAMA_NUM_THREADSAll available coresCPU threads used for inference. Defaults to all cores; tuning this per your workload can improve throughput

Applying configuration via systemd override

sudo mkdir -p /etc/systemd/system/ollama.service.d

sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
# Keep the model warm between requests (reduces reload latency)
Environment="OLLAMA_KEEP_ALIVE=10m"

# Restrict to localhost - remove or change for network access
Environment="OLLAMA_HOST=127.0.0.1:11434"

# One model in memory at a time - appropriate for CPU-only systems
Environment="OLLAMA_MAX_LOADED_MODELS=1"
EOF

Reload and restart to apply:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Tuning CPU thread count

By default Ollama uses all available cores. On systems running other services alongside Ollama, leaving one or two cores free prevents the inference load from starving the OS and other processes:

# Check how many cores are available
nproc

# Add to your override.conf if you want to reserve cores for other workloads
# Replace 6 with nproc minus 2
Environment="OLLAMA_NUM_THREADS=6"

Network exposure The default bind address 127.0.0.1 means Ollama is not reachable from other machines on the network. If you need remote access, set OLLAMA_HOST=0.0.0.0:11434 and enforce access control at the firewall level with ufw. Ollama has no built-in authentication, so make sure to never expose port 11434 directly to the public internet.

Creating a custom Modelfile

A Modelfile lets you bake a persistent system prompt and inference parameters into a named model variant. This is useful when you want consistent behaviour across sessions without repeating yourself in every prompt.

mkdir -p ~/ollama-models

cat > ~/ollama-models/Modelfile.llama31 <<'EOF'
# Base model
FROM llama3.1:8b

# System prompt applied to every conversation
SYSTEM """
You are a knowledgeable and precise technical assistant. Answer questions
directly and completely. When providing code, include comments that explain
non-obvious choices. When unsure, say so explicitly rather than guessing.
"""

# Inference parameters
PARAMETER temperature    0.7
PARAMETER top_p          0.9
PARAMETER num_ctx        8192
PARAMETER repeat_penalty 1.1
EOF

Build the custom variant:

ollama create llama31-custom -f ~/ollama-models/Modelfile.llama31
ollama list
    NAME ID SIZE MODIFIED llama31-custom:latest a1b2c3d4e5f6 4.9 GB X seconds ago llama3.1:8b b2c3d4e5f6a1 4.9 GB Y minutes ago

Key Modelfile parameters

ParameterTypical rangeEffect
temperature0.0 – 1.0Higher values increase output variety; lower values make responses more deterministic. 0.7 is a solid default for general use.
top_p0.7 – 0.95Nucleus sampling threshold. Restricts token selection to the top probability mass. Lower values reduce randomness further.
num_ctx2048 – 131072Context window in tokens. Llama 3.1 supports up to 128K, but larger values consume proportionally more RAM. 8192 is a practical default for CPU-only setups.
repeat_penalty1.0 – 1.3Penalises the model for repeating tokens it has already produced. Values above 1.15 can harm coherence on long responses.

Testing the installation

The following tests progress from basic connectivity through to functional validation. Run them in order.

Test 1 -> Service and API health

sudo systemctl is-active ollama
curl -fsS http://127.0.0.1:11434/api/version
active{"version":"0.x.x"}

Test 2 -> Model is registered

ollama list

Expected:

llama3.1:8b appears in the output with a size around 4.9 GB.

Test 3 -> Single-shot inference

Send a non-trivial prompt that requires actual reasoning, not just pattern completion. This confirms the model loads correctly and produces coherent output:

ollama run llama3.1:8b "Explain the difference between a process and a thread in Linux. Be concise."

On a CPU-only system, expect the first token in 10–30 seconds as the model loads into RAM. Once loaded, tokens stream at roughly 5–15 per second depending on hardware. Any coherent response means the model is functional.

Monitor RAM during the first load Run watch -n1 free -h in a second terminal while inference runs. This confirms the model fits in RAM and the system is not swapping. If you see swap growing, close other applications before continuing.

Test 4 -> Multi-turn context retention

Open an interactive session and verify the model correctly tracks information across turns:

ollama run llama3.1:8b

>>> The server I am configuring runs Ubuntu 24.04 with 32 GB RAM.
Understood. A well-provisioned Ubuntu 24.04 server with 32 GB RAM gives you comfortable headroom for most workloads. What are you configuring?

>>> How much RAM does my server have?
Your server has 32 GB of RAM, as you mentioned.

The model correctly recalling the RAM figure from earlier in the conversation confirms multi-turn context is working. Exit with /bye or Ctrl+D.

Test 5 -> Custom variant

If you built the custom Modelfile in section 08, verify it runs and reflects the system prompt:

ollama run llama31-custom "Write a bash function that checks if a port is open on a remote host."

Expect clean, commented bash code reflecting the system prompt instruction to comment non-obvious choices.

Test 6 -> Process state

After any inference, check the model’s loaded state and resource usage:

ollama ps
NAME ID SIZE PROCESSOR UNTIL llama3.1:8b a1b2c3d4e5f6 6.0 GB 100% CPU X minutes from now

PROCESSOR: 100% CPU confirms no GPU is in use. UNTIL shows when the model will be unloaded from RAM based on the OLLAMA_KEEP_ALIVE value set in section 07. After this timer expires, the next request will reload the model from disk, adding initial latency.

All tests passing? If the service is active, the model is listed, single-shot and interactive prompts both produce coherent output, and ollama ps shows the model loaded on CPU, the installation is complete and correctly configured.

Troubleshooting

SymptomLikely causeFix
Connection refused on port 11434Service not runningsudo systemctl start ollama then check journalctl -u ollama -n 30
ollama command not found after installGroup membership not applied to current sessionLog out and back in, or run newgrp ollama. Binary is at /usr/local/bin/ollama.
Model pull stalls mid-downloadNetwork interruptionRe-run ollama pull llama3.1:8b – downloads resume from the last completed chunk
Inference runs but swap grows steadilyInsufficient free RAM for model + KV cacheClose other processes. Reduce context window in Modelfile: PARAMETER num_ctx 2048. Consider llama3.2:3b instead.
Token generation under 2 tokens/secSwap activity or AVX2 absentCheck free -h during inference. Check AVX2: grep -o 'avx2' /proc/cpuinfo | head -1
ollama create fails with model not foundBase model not yet pulledRun ollama pull llama3.1:8b before ollama create
Configuration changes not taking effectsystemd not reloaded after override editsudo systemctl daemon-reload && sudo systemctl restart ollama
Model unloads between every requestOLLAMA_KEEP_ALIVE too shortSet Environment="OLLAMA_KEEP_ALIVE=10m" in the systemd override (section 07)

Diagnostic commands

# Stream service logs in real time
sudo journalctl -u ollama -f

# Check disk space available for model storage
df -h /usr/share/ollama

# Watch RAM usage while inference runs
watch -n1 'free -h'

# Confirm AVX2 instruction set is available
grep -o 'avx2' /proc/cpuinfo | head -1

# Show the installed service unit
cat /etc/systemd/system/ollama.service

# Remove a model to recover disk space
ollama rm llama3.1:8b

Ivan Dabić

A man with a beard and glasses, wearing an orange hoodie and a black cap with a Hard Rock Cafe logo, stands with his arms crossed against a plain white background.

Ivan Dabić

Co-founder and CEO of BlueGrid.io, with a background in cloud infrastructure, distributed systems, monitoring, and security operations. He works closely with engineering teams to build and operate reliable systems while documenting both technical and organizational aspects of modern engineering work.

Ivan is a metalhead, and big fan of cyberpunk move genre. If you are his secret Santa go with Star Wars Lego box!

Share this post

Share this link via

Or copy link