Latency work gets easier when you carry a few back-of-the-envelope budgets in your head.
This is a practical cheat sheet spanning:
- Local compute
- Network RTT
- Handshake overhead
- LLM first-token latency
Latency Ladder (Back-of-Envelope)
0.001ms 0.01ms 0.1ms 1ms 10ms 100ms
| | | | | |
Hash ------*
Norm+Hash ---------*
Phonetic ----------*
Jaccard ---------------------*-------*
SimHash ---------------------*-------*
Edit dist -------------------*-------*
Rolling hash ----------------*-------*
MinHash+LSH -------------------------*-------*
TF-IDF cos --------------------------*-------*
Embeddings -----------------------------------*--------*
(CPU)Ranges vary by corpus and hardware. Use this for planning, not benchmarking claims.
L1, L2, L3: Why Cache Misses Matter
Rough order-of-magnitude latency:
- L1 cache hit: sub-nanosecond to ~1ns class.
- L2 cache hit: few ns.
- L3 cache hit: ~10ns class.
- DRAM: tens of ns to ~100ns+.
The point is not exact numbers. The point is scale: a miss that falls out of cache is often an order-of-magnitude jump.
RTT Budgets You Should Memorize
Useful mental defaults:
- Same AZ / same region service call: single-digit ms.
- US West <-> US East: ~60-85ms healthy, budget ~70ms baseline.
- Tail reality for public internet: 90-120ms+ is common.
That tail is what users feel.
TCP/TLS Handshake Tax
Startup is often RTT-multiplied:
- TCP 3-way handshake: ~1 RTT.
- TLS 1.3 full handshake: ~1 RTT.
- First request/first-byte: additional RTT-scale component.
If you do not reuse connections, coast-to-coast startup can easily cost hundreds of milliseconds before application logic runs.
LLM TTFT: Separate Queue, Prefill, and Network
A practical TTFT decomposition:
TTFT ~= queue_wait + prefill_compute + network_overheadWhy this matters:
- Queue is infra/scheduling.
- Prefill is model/prompt/hardware.
- Network is placement/transport.
A model can have strong decode speed and still feel slow if queue+prefill+network dominates first token.
Example shape (illustrative):
- GPT-4o-style fast path: lower queue + efficient prefill -> lower TTFT band.
- Slower path under load: queue spikes dominate and TTFT jumps even with same model.
End-to-End Rule of Thumb
When debugging latency:
- Start with percentile budgets (p50/p95/p99), not averages.
- Split compute from transport.
- Count handshake RTTs explicitly.
- Measure queue and prefill separately for LLMs.
Most production wins come from reducing variance and tails, not shaving microseconds off the median.