Tokens Per Second

Short definition

Tokens per second (t/s) is the rate at which a language model produces output tokens during inference. It is the primary metric for perceived inference speed in interactive applications. A low t/s makes generation feel visibly slow; a high t/s makes responses feel instant.

Extended definition

When a user sends a prompt to a language model, the model processes that input and generates a response one token at a time. Tokens per second measures how quickly that generation happens. For interactive sessions such as chat interfaces or code completion tools, a sustained rate below approximately 3 t/s creates a noticeable lag that degrades user experience. Rates between 5 and 20 t/s feel acceptable. Above 30 t/s, generation typically outpaces reading speed.

Tokens per second is affected by model size, quantization level, hardware capability, and context length. A smaller model in a lower-bit quantization format runs faster but may produce lower-quality output. The same model on a GPU with high memory bandwidth will outperform CPU-only inference by a significant margin. Engineers use t/s as a practical benchmark when selecting hardware or deployment configurations for AI features.

In batch inference scenarios, where many prompts are processed simultaneously rather than one at a time, throughput is measured differently. A system might process fewer tokens per second per request but handle more total requests in parallel. Understanding whether a use case demands low latency per user or high aggregate throughput shapes the deployment strategy.

Deep technical explanation

Prefill vs. decode phases

Inference in transformer models happens in two distinct phases. During prefill, the model processes the entire input prompt in parallel, computing attention scores across all input tokens at once. This phase is computationally intensive but fast because it runs as a single forward pass. During decode, the model generates output tokens one at a time, each requiring a full forward pass. Tokens per second as a user-facing metric primarily reflects decode throughput, not prefill speed.

KV cache and memory bandwidth as the bottleneck

Each decode step requires the model to attend over all previously generated tokens. This attention data is stored in a key-value (KV) cache that grows with every new token. As context length increases, the KV cache consumes more memory and must be read back on each decode step. At large context lengths, memory bandwidth becomes the primary constraint rather than raw compute. A CPU or GPU that has excellent floating-point throughput but limited memory bandwidth will see t/s degrade sharply as context grows.

CPU inference and instruction sets

On CPU-only inference using a quantized 8B parameter model in Q4_K_M format, typical throughput is 5 to 15 t/s depending on CPU generation, available thread count, and whether the CPU supports AVX2 or AVX-512 vector instructions. AVX2 allows the CPU to process wider data chunks per clock cycle. Without it, inference frameworks fall back to slower scalar paths. Thread count matters because matrix operations during decode can be parallelized across physical cores, up to the point where memory bandwidth becomes saturated.

Quantization and its effect on throughput

Quantization reduces model weights from 16-bit or 32-bit floating-point values to lower-precision integers (4-bit, 5-bit, 8-bit). This shrinks the model’s memory footprint and reduces memory bandwidth requirements per token generated. Q4_K_M is a common format that applies mixed-precision quantization, preserving accuracy on critical weight groups while aggressively compressing others. Smaller quantization formats increase t/s at the cost of some output quality. Engineers must evaluate the quality-speed tradeoff specific to their application.

Common failure modes

Running a model that exceeds available RAM forces the system to use swap space, which collapses t/s to near zero. Thermal throttling on consumer hardware reduces clock speeds under sustained load, causing t/s to drop 30 to 50 percent after the first few minutes. Misconfigured thread counts, particularly oversubscribing cores on a machine running other workloads, can reduce t/s compared to a lower but correctly tuned thread count. Monitoring t/s over time rather than only at startup reveals these degradation patterns.

Practical examples

Example 1: Code completion tool on developer hardware

A team wanted to run a local code completion model on developer laptops without a GPU. They selected a 7B model quantized to Q4_K_M and benchmarked on a recent AMD CPU with AVX2 support. They achieved 11 t/s, which felt responsive enough for inline suggestions. Developers with older CPUs lacking AVX2 saw 4 t/s, which caused visible lag and required them to use a shared GPU server instead.

Example 2: Chatbot for a customer-facing product

A product team embedded a customer support chatbot powered by a hosted LLM. At peak load, concurrent sessions grew and the shared inference server dropped from 25 t/s to 6 t/s per user. Users began abandoning sessions. The team added a second inference instance and implemented request queuing, restoring per-session t/s above 15 and reducing session abandonment significantly.

Example 3: Long-context summarization pipeline

An analytics team ran a pipeline that summarized long documents using a model with a 32K token context window. They noticed throughput dropped from 18 t/s at the start of generation to under 5 t/s near the end of long documents. Profiling confirmed KV cache memory reads were the bottleneck. Switching to a GPU with higher memory bandwidth resolved the degradation at long contexts.

Why it matters

  • Tokens per second directly determines whether an AI feature feels responsive or frustrating to users in interactive applications.
  • It is the key metric for hardware selection when deploying local or self-hosted inference, affecting both cost and user experience.
  • Throughput degrades as context grows due to KV cache memory pressure, meaning engineers must benchmark at realistic context lengths, not just at generation start.
  • Quantization format choices directly trade off t/s against output quality, making it a concrete engineering decision rather than a theoretical one.
  • Failure to monitor t/s over time masks performance degradation from thermal throttling, swap usage, or memory fragmentation in long-running inference servers.
  • Understanding the prefill versus decode split helps teams correctly attribute latency: a slow time-to-first-token points to prefill, while slow streaming points to decode throughput.

Share this post

Share this link via

Or copy link