Short definition
A KV cache is memory allocated during transformer inference to store the attention keys and values computed for every token seen so far in a conversation. It allows the model to skip recomputing attention over previous tokens on each generation step. Without it, inference cost would scale quadratically with sequence length.
Extended definition
Every time a transformer model generates a new token, it runs a self-attention operation that considers all previous tokens in the context window. Without caching, the attention keys and values for every past token would be recomputed from scratch on each step. The KV cache solves this by storing those computed keys and values in memory so they can be reused.
The cache grows linearly with context length. For every additional token in the conversation, the system appends new key and value tensors to the existing cache. This makes the KV cache a primary source of RAM pressure, especially on CPU-only systems where memory bandwidth is limited and there is no GPU VRAM to fall back on.
For an 8B parameter model running with a 4096-token context window, the KV cache typically consumes between 512MB and 1GB of RAM, depending on the numerical precision used. Doubling the context window to 8192 tokens roughly doubles that footprint. In long conversations or high-concurrency deployments, this growth is the most common cause of out-of-memory (OOM) kills.
Engineers integrating large language models into production systems need to account for KV cache size when planning memory budgets, setting context window limits, and designing session management strategies. Ignoring it leads to unpredictable crashes under load.
Deep technical explanation
How attention keys and values are structured
In a transformer model, each attention layer computes three tensors from the input: queries (Q), keys (K), and values (V). The query for the current token is compared against all previous keys to produce attention weights, which are then applied to the values to produce the output. The KV cache stores the K and V tensors for every token across every layer. Queries are not cached because they are always computed fresh for the current token.
The size of the KV cache per token is determined by the number of layers, the number of attention heads, the head dimension, and the data type precision. For a model using 16-bit floats (float16 or bfloat16), the formula is roughly: 2 (for K and V) times the number of layers times the number of key-value heads times the head dimension times 2 bytes. Multi-query attention (MQA) and grouped-query attention (GQA) architectures reduce this by sharing K and V heads across multiple query heads, which is why models like LLaMA 3 and Mistral use GQA to lower KV cache memory requirements.
Memory growth and OOM failure modes
Because the KV cache grows with every generated token, the memory footprint of a single inference session increases throughout a conversation. In a stateless HTTP API, each request typically carries the full conversation history, which means the server rebuilds the KV cache from scratch on every call unless explicit caching or prefix caching is implemented.
Concurrent sessions multiply the problem. If ten users are each mid-way through a 4096-token conversation with an 8B model, the system needs to hold ten separate KV caches simultaneously. This is why memory planning for LLM serving must account for both per-session cache size and the expected number of concurrent sessions.
OOM kills during long conversations follow a predictable pattern: the process runs fine during short exchanges, then crashes once the context exceeds the point where available RAM is exhausted. On Linux, the OOM killer terminates the process without warning. Proper monitoring of per-process memory consumption and setting explicit context length limits in the inference server configuration are the primary defenses.
Precision, quantization, and KV cache reduction
Quantizing model weights reduces the weight memory footprint but does not directly reduce KV cache size unless KV cache quantization is also applied. Some inference frameworks, including llama.cpp and vLLM, support 8-bit or 4-bit KV cache quantization, which cuts the cache size roughly in half or by 75% at the cost of minor precision loss. This is a practical option for CPU-only deployments where RAM is the binding constraint.
Practical examples
Scenario 1: OOM crash in a self-hosted chatbot
A team deployed an 8B model on a 16GB RAM server with num_ctx set to 8192. Single-user sessions worked fine, but once two users had long simultaneous conversations, the process crashed. The root cause was two KV caches each approaching 2GB. Reducing num_ctx to 4096 and enabling KV cache quantization brought peak usage under 2GB total, resolving the crashes.
Scenario 2: Latency spikes from prefix cache misses
A document Q&A system prepended a 2000-token system prompt to every request. Without prefix caching, the model recomputed the KV cache for those 2000 tokens on every call. Enabling prefix caching in vLLM allowed the server to reuse the cached keys and values for the shared prefix, cutting time-to-first-token by 40% under load.
Scenario 3: Capacity planning for concurrent users
An engineering team needed to support 50 concurrent users on a 70B model with a 4096-token context window. Calculating KV cache size per session at float16 precision showed each session would consume roughly 3.5GB. Supporting 50 sessions would require 175GB of memory just for KV caches, far exceeding available hardware. The team shifted to a GQA-based model variant, reducing per-session KV cache to under 1GB and making the deployment feasible on four 48GB GPU nodes.
Scenario 4: Context window tuning for a coding assistant
A development team running a local coding assistant found that setting num_ctx to 16384 caused the inference process to consume 6GB of RAM before generation began on an 8B model. Profiling confirmed KV cache was the primary consumer. Reducing num_ctx to 8192 halved the baseline memory cost and allowed other system processes to run without contention.
Why it matters
- KV cache size is the primary RAM constraint in LLM inference deployments, not model weights alone.
- Doubling the context window doubles the KV cache memory footprint, which directly affects how many concurrent sessions a server can support.
- OOM failures during long conversations are almost always caused by KV cache growth, not by the model itself consuming more memory.
- Prefix caching and KV cache quantization are concrete techniques that reduce memory pressure and improve inference latency at the same time.
- Engineers must account for KV cache size per session multiplied by expected concurrency when sizing infrastructure for LLM workloads.
- Model architecture choices such as grouped-query attention have a direct and measurable impact on KV cache size, making architecture selection a capacity planning decision.