Context Window

Short definition

A context window is the maximum number of tokens a language model can hold in working memory during a single inference session, covering both the input prompt and all generated output. Once the limit is reached, the model can no longer reference earlier content in that session. The size of the context window directly determines how much information a model can reason over at one time.

Extended definition

Every transformer-based language model processes text as a sequence of tokens. The context window defines the upper bound on how many tokens that sequence can contain before the oldest tokens are dropped or the request is rejected outright. This boundary applies to the combined total of your system prompt, user input, retrieved documents, conversation history, and the model’s own generated response.

In practice, the context window is the primary constraint governing what an LLM can do in a single call. A model with a 4096-token context window cannot summarize a 10,000-word document in one pass. A coding assistant with an 8192-token limit may lose the beginning of a long file before it finishes generating a refactor. Retrieval-augmented generation (RAG) pipelines exist largely to work around context window limits by injecting only the most relevant chunks.

Models like Llama 3.1 8B support up to 128K tokens at full precision, which is enough to hold entire codebases or lengthy legal documents. However, on memory-constrained hardware running quantized weights such as Q4_K_M, teams routinely set practical context to 2048 to 8192 tokens to avoid out-of-memory kills. Choosing the right context size is a trade-off between capability and available RAM, not a one-time configuration decision.

In Ollama, the context window is controlled by the num_ctx parameter, set either in a Modelfile or passed at runtime. This parameter directly governs how much KV cache the runtime allocates, which determines peak memory consumption for the model process.

Deep technical explanation

How the KV cache scales with context length

Transformer models use a key-value (KV) cache to store intermediate attention states for every token in the current sequence. The memory required for this cache scales linearly with context length and with the number of attention layers and heads in the model. A model with 32 attention layers processing 8192 tokens will hold significantly more in memory than the same model processing 2048 tokens, even with identical weights. This is why doubling num_ctx can more than double peak RAM usage in practice.

Quantization and its effect on practical context limits

Quantization reduces the size of model weights but does not reduce the KV cache by the same proportion. A Q4_K_M quantization of Llama 3.1 8B compresses the weights to roughly 4-5 GB, making the model loadable on an 8 GB GPU. However, the KV cache for a 128K token context at this model size would still require tens of gigabytes of additional memory, far exceeding what is available. This mismatch means that on constrained hardware, teams must set num_ctx to 2048 to 8192 to keep total memory within the physical limit.

Out-of-memory failures and how they present

When context length exceeds available RAM, the process is terminated by the operating system with an OOM kill signal. This failure mode is often mistaken for a model crash or a runtime bug. In containerized environments, the container is killed and restarted, which can cause silent data loss in stateful workloads. Monitoring tools like Grafana and system-level OOM logs are essential for diagnosing these failures.

Context window in multi-turn and agentic workloads

In multi-turn chat applications, conversation history accumulates with every exchange. Without truncation or summarization logic, the context fills up within a predictable number of turns. Agentic systems face a harder version of this problem: tool call results, retrieved documents, and intermediate reasoning steps all compete for the same token budget. Production systems must implement explicit context management strategies, such as sliding window truncation, summary compression, or selective retrieval, to stay within limits without degrading response quality.

Attention complexity and long-context latency

Self-attention in transformers has quadratic complexity relative to sequence length in the naive implementation. Longer context windows increase both memory use and inference latency. Techniques like grouped query attention (GQA) and sparse attention reduce this overhead, and models such as Llama 3.1 incorporate these optimizations. Even so, inference at 32K tokens is noticeably slower than at 4K tokens on the same hardware.

Practical examples

Code review assistant on a developer laptop

A developer runs a local Ollama instance with Llama 3.1 8B Q4_K_M to review pull requests. The laptop has 16 GB unified memory. Setting num_ctx to 8192 leaves enough headroom for the model weights and OS overhead. The assistant handles files up to roughly 600 lines of code per request without triggering OOM errors.

Document summarization pipeline hitting context limits

A legal tech team builds a pipeline to summarize 50-page contracts. A single-pass approach with a 4096-token context fails silently, truncating the input. They switch to a chunked map-reduce approach: splitting documents into 3000-token segments, summarizing each independently, then combining the summaries. Response quality improves and OOM kills disappear.

RAG system with context budget management

A security analytics team builds a RAG system over threat intelligence reports. Each query retrieves multiple document chunks. Without token counting, the combined prompt occasionally exceeds the 8192-token limit. The team adds a pre-flight token counter that trims retrieved chunks to fit within 6000 tokens, reserving space for the system prompt and generated response.

Agentic pipeline with tool call accumulation

An AI agent running in an autonomous pipeline makes sequential tool calls to query APIs and parse results. After eight tool calls, the accumulated history fills the 8192-token context and the agent starts dropping earlier reasoning steps. The team implements a rolling summary that compresses older turns, keeping the active context under 6000 tokens throughout the run.

Why it matters

Context window size directly determines what tasks a model can complete in a single inference call, which sets the boundary for what your application can do without chunking or retrieval.
Mismatched context configuration is one of the most common causes of silent failures in LLM applications, where truncated input produces low-quality output without any visible error.
KV cache memory scales with context length, so hardware planning and context window sizing must be done together, not independently.
Agentic and multi-turn systems require explicit context management code because the window fills up predictably as conversations and tool histories grow.
Exceeding available RAM causes an OOM kill that terminates the process, which in production environments means request failures or container restarts that are hard to diagnose without proper monitoring.
Choosing the optimal context size is a runtime engineering decision that balances task requirements, hardware constraints, and inference latency, not a default to accept from a model card.