Andrew Ho

Most people meet LLMs through chat UIs. Underneath, inference is a systems problem: memory pressure, queueing, scheduling, and latency tradeoffs.

This guide goes bottom-up from naive generation to modern production serving.

Level 0: The Core Loop

Inference is simple in definition:

Given tokens, predict the next token.
Append it.
Repeat.

text

f(sequence) -> logits over vocabulary
f(token, cache) -> (logits, updated_cache)   # decode step

Naive autoregressive generation:

python

def generate(prompt_ids, model, max_tokens=100):
    sequence = list(prompt_ids)
    for _ in range(max_tokens):
        logits = model(sequence)       # full forward pass
        next_logits = logits[-1]
        next_token = sample(next_logits)
        if next_token == EOS:
            break
        sequence.append(next_token)
    return sequence

The problem is obvious at scale: you keep reprocessing the same prefix.

Level 1: KV Cache

The first real optimization is KV cache.

Attention for token t needs:

Q from current token.
Historical K, V from prior tokens.

Cache historical K, V once; append new entries as decode advances.

Prefill vs Decode

text

Prefill phase:
  Input: full prompt [t1..tP]
  Output: first next-token logits + KV cache for positions 1..P

Decode phase:
  Input: one token + existing KV cache
  Output: next-token logits + cache appended by 1 token

python

def generate(prompt_ids, model, max_tokens=100):
    logits, cache = model.prefill(prompt_ids)
    next_token = sample(logits[-1])
    sequence = list(prompt_ids) + [next_token]

    for _ in range(max_tokens - 1):
        logits, cache = model.decode(next_token, cache)
        next_token = sample(logits)
        if next_token == EOS:
            break
        sequence.append(next_token)
    return sequence

KV cache shape (per layer/head):

text

K: [batch, num_heads, seq_len, head_dim]
V: [batch, num_heads, seq_len, head_dim]

Level 2: Batching

Single-request inference underutilizes GPU hardware.

Static batching improves utilization, but has a tail-latency issue:

Short requests wait behind long requests.
Batch turnover is gated by slowest item.

Level 3: Continuous Batching

Continuous batching fixes that by replacing finished requests immediately.

text

T0: [A prefill] [B prefill] [C prefill] [D prefill]
T1: [A decode ] [B decode ] [C decode ] [D decode ]
T2: [A EOS    ] [B decode ] [C decode ] [D decode ]
T3: [E prefill] [B decode ] [C decode ] [D decode ]   # A replaced by E

Outcome:

Higher occupancy.
Better throughput under mixed request lengths.

Level 4: The Real Bottleneck Is Memory

KV cache dominates serving economics.

It scales with:

batch size
context length
layer/head dimensions

Naive per-request contiguous allocation causes fragmentation and over-reservation.

Level 5: PagedAttention (vLLM)

PagedAttention treats KV cache like virtual memory.

Instead of contiguous blocks per request:

Split cache into fixed-size blocks (pages).
Allocate blocks on demand.
Track logical-to-physical mapping via block tables.

text

Request A logical blocks: [0] [1] [2] -> physical [P3] [P7] [P1]
Request B logical blocks: [0] [1]     -> physical [P3] [P9]      (shared prefix)

Benefits:

lower fragmentation
tighter memory-to-work ratio
prefix sharing becomes practical

Level 6: Speculative Decoding

Decode is memory-bound and token-serial. Speculative decoding adds parallel verification.

Pattern:

Small draft model proposes k tokens.
Target model verifies candidates in one pass.
Accept longest correct prefix.

text

Draft proposes:  "The" "quick" "brown"
Target verifies:  ✓      ✓      ✗
Accept: "The quick"
Resample from rejection point

When draft quality is strong, this reduces expensive target passes.

Level 7: Prefill-Decode Disaggregation

Prefill and decode have different hardware profiles:

text

Prefill: compute-heavy, parallel-friendly
Decode : memory-bandwidth-heavy, token-serial

Running both on one pool causes interference. Disaggregation separates them:

text

[Prefill cluster] -- KV/state handoff --> [Decode cluster]

This is the serving analogue of separating ingestion and query paths in data systems.

Level 8: Prefix Caching

Many workloads repeat long prefixes (system prompts, policy blocks, shared context).

Without prefix caching:

Every request repays full prefill cost.

With prefix caching:

Reuse cached prefix KV.
Prefill only the suffix delta.

This is one of the highest leverage optimizations in enterprise chat workloads.

Level 9: The Modern Inference Stack

text

KV cache
  + continuous batching
  + paged KV memory management
  + speculative decoding
  + prefill/decode separation
  + prefix reuse
= production-grade serving

Metrics That Actually Matter

TTFT (time to first token)
- mostly prefill and queueing.
ITL (inter-token latency)
- decode path quality.
Throughput (tokens/s)
- aggregate system capacity.
Goodput
- useful output per unit compute (penalizes wasted work).

Practical Engineering Takeaways

Early wins come from KV cache and batching.
Real scale usually fails on memory layout, not math kernels.
Tail behavior dominates user experience more than median throughput.
Prefix reuse and queue policy often beat model-level micro-optimizations.
Treat inference serving as a systems architecture problem, not a single-model benchmark.

LLM Inference: From Zero to Modern

Level 0: The Core Loop

Level 1: KV Cache

Prefill vs Decode

Level 2: Batching

Level 3: Continuous Batching

Level 4: The Real Bottleneck Is Memory

Level 5: PagedAttention (vLLM)

Level 6: Speculative Decoding

Level 7: Prefill-Decode Disaggregation

Level 8: Prefix Caching

Level 9: The Modern Inference Stack

Metrics That Actually Matter

Practical Engineering Takeaways

Further Reading