Most people meet LLMs through chat UIs. Underneath, inference is a systems problem: memory pressure, queueing, scheduling, and latency tradeoffs.
This guide goes bottom-up from naive generation to modern production serving.
Level 0: The Core Loop
Inference is simple in definition:
- Given tokens, predict the next token.
- Append it.
- Repeat.
f(sequence) -> logits over vocabulary
f(token, cache) -> (logits, updated_cache) # decode stepNaive autoregressive generation:
def generate(prompt_ids, model, max_tokens=100):
sequence = list(prompt_ids)
for _ in range(max_tokens):
logits = model(sequence) # full forward pass
next_logits = logits[-1]
next_token = sample(next_logits)
if next_token == EOS:
break
sequence.append(next_token)
return sequenceThe problem is obvious at scale: you keep reprocessing the same prefix.
Level 1: KV Cache
The first real optimization is KV cache.
Attention for token t needs:
Qfrom current token.- Historical
K, Vfrom prior tokens.
Cache historical K, V once; append new entries as decode advances.
Prefill vs Decode
Prefill phase:
Input: full prompt [t1..tP]
Output: first next-token logits + KV cache for positions 1..P
Decode phase:
Input: one token + existing KV cache
Output: next-token logits + cache appended by 1 tokendef generate(prompt_ids, model, max_tokens=100):
logits, cache = model.prefill(prompt_ids)
next_token = sample(logits[-1])
sequence = list(prompt_ids) + [next_token]
for _ in range(max_tokens - 1):
logits, cache = model.decode(next_token, cache)
next_token = sample(logits)
if next_token == EOS:
break
sequence.append(next_token)
return sequenceKV cache shape (per layer/head):
K: [batch, num_heads, seq_len, head_dim]
V: [batch, num_heads, seq_len, head_dim]Level 2: Batching
Single-request inference underutilizes GPU hardware.
Static batching improves utilization, but has a tail-latency issue:
- Short requests wait behind long requests.
- Batch turnover is gated by slowest item.
Level 3: Continuous Batching
Continuous batching fixes that by replacing finished requests immediately.
T0: [A prefill] [B prefill] [C prefill] [D prefill]
T1: [A decode ] [B decode ] [C decode ] [D decode ]
T2: [A EOS ] [B decode ] [C decode ] [D decode ]
T3: [E prefill] [B decode ] [C decode ] [D decode ] # A replaced by EOutcome:
- Higher occupancy.
- Better throughput under mixed request lengths.
Level 4: The Real Bottleneck Is Memory
KV cache dominates serving economics.
It scales with:
- batch size
- context length
- layer/head dimensions
Naive per-request contiguous allocation causes fragmentation and over-reservation.
Level 5: PagedAttention (vLLM)
PagedAttention treats KV cache like virtual memory.
Instead of contiguous blocks per request:
- Split cache into fixed-size blocks (pages).
- Allocate blocks on demand.
- Track logical-to-physical mapping via block tables.
Request A logical blocks: [0] [1] [2] -> physical [P3] [P7] [P1]
Request B logical blocks: [0] [1] -> physical [P3] [P9] (shared prefix)Benefits:
- lower fragmentation
- tighter memory-to-work ratio
- prefix sharing becomes practical
Level 6: Speculative Decoding
Decode is memory-bound and token-serial. Speculative decoding adds parallel verification.
Pattern:
- Small draft model proposes
ktokens. - Target model verifies candidates in one pass.
- Accept longest correct prefix.
Draft proposes: "The" "quick" "brown"
Target verifies: ✓ ✓ ✗
Accept: "The quick"
Resample from rejection pointWhen draft quality is strong, this reduces expensive target passes.
Level 7: Prefill-Decode Disaggregation
Prefill and decode have different hardware profiles:
Prefill: compute-heavy, parallel-friendly
Decode : memory-bandwidth-heavy, token-serialRunning both on one pool causes interference. Disaggregation separates them:
[Prefill cluster] -- KV/state handoff --> [Decode cluster]This is the serving analogue of separating ingestion and query paths in data systems.
Level 8: Prefix Caching
Many workloads repeat long prefixes (system prompts, policy blocks, shared context).
Without prefix caching:
- Every request repays full prefill cost.
With prefix caching:
- Reuse cached prefix KV.
- Prefill only the suffix delta.
This is one of the highest leverage optimizations in enterprise chat workloads.
Level 9: The Modern Inference Stack
KV cache
+ continuous batching
+ paged KV memory management
+ speculative decoding
+ prefill/decode separation
+ prefix reuse
= production-grade servingMetrics That Actually Matter
TTFT(time to first token)- mostly prefill and queueing.
ITL(inter-token latency)- decode path quality.
- Throughput (tokens/s)
- aggregate system capacity.
- Goodput
- useful output per unit compute (penalizes wasted work).
Practical Engineering Takeaways
- Early wins come from KV cache and batching.
- Real scale usually fails on memory layout, not math kernels.
- Tail behavior dominates user experience more than median throughput.
- Prefix reuse and queue policy often beat model-level micro-optimizations.
- Treat inference serving as a systems architecture problem, not a single-model benchmark.
Further Reading
- vLLM / PagedAttention paper.
- DistServe (prefill/decode disaggregation).
- SGLang (prefix/radix-style caching approaches).
- Speculative decoding work (Leviathan et al.).