1.5: Decode — Computational Deep Dive
The Paradox We Need to Explain
In Artifact 4, we saw that prefill processes 512 tokens in ~30-50ms. Simple math suggests decode should be faster—after all, we're only processing 1 token per step instead of 512.
But reality is the opposite: each decode step takes ~10ms. Processing 200 tokens takes ~2000ms. That's 40× slower than prefill despite processing 2.5× fewer tokens.
This artifact explains why. The answer lies in arithmetic intensity and the memory-bandwidth bound.
What Happens in a Single Decode Step
Let's trace through exactly what operations occur when we generate one token. Assume we're at position seq_len (the cache already contains seq_len tokens' worth of K and V).
Operation 1: Embedding Lookup
Input: 1 token ID
Output: [1, hidden_dim] tensor, e.g., [1, 4096]
This is a simple table lookup—negligible time.
Operation 2: Q, K, V Projections (Per Layer)
hidden_states: [1, 4096]
W_Q: [4096, 4096] ← 67 MB to read!
W_K: [4096, 4096] ← 67 MB to read!
W_V: [4096, 4096] ← 67 MB to read!
Q_new = hidden_states @ W_Q → [1, 4096]
K_new = hidden_states @ W_K → [1, 4096]
V_new = hidden_states @ W_V → [1, 4096]
Operation 3: KV Cache Read and Update
Read from cache:
K_cached: [seq_len, num_heads, head_dim] e.g., [1000, 32, 128] = 16 MB
V_cached: [seq_len, num_heads, head_dim] e.g., [1000, 32, 128] = 16 MB
Append new K, V:
K_full = concat(K_cached, K_new) → [seq_len+1, 32, 128]
V_full = concat(V_cached, V_new) → [seq_len+1, 32, 128]
Write back to cache: 32 MB
Operation 4: Attention Computation
Q_new: [1, 32, 128] (32 heads, 128 dim each)
K_full: [seq_len+1, 32, 128]
For each head:
scores = Q_new @ K_full.T → [1, seq_len+1]
weights = softmax(scores) → [1, seq_len+1]
output = weights @ V_full → [1, 128]
This is a [1, seq_len+1] attention—one query against many keys.
Operation 5: Output Projection
attention_output: [1, 4096]
W_O: [4096, 4096] ← 67 MB to read!
output = attention_output @ W_O → [1, 4096]
Operation 6: FFN Layers
x: [1, 4096]
W1: [4096, 16384] ← 268 MB to read!
W2: [16384, 4096] ← 268 MB to read!
ffn_out = GELU(x @ W1) @ W2 → [1, 4096]
Operation 7: Repeat for All Layers, Then Output
After all 32 layers:
final_hidden: [1, 4096]
W_output: [4096, vocab_size] e.g., [4096, 32000] = 524 MB to read!
logits = final_hidden @ W_output → [1, 32000]
next_token = sample(logits)
The Arithmetic Intensity Disaster
Now let's calculate the arithmetic intensity for these decode operations. Remember from Artifact 4: we need ~156 FLOPs/byte to keep an A100 fully utilized.
Q Projection (Decode vs Prefill Comparison)
Decode (1 token):
hidden_states: [1, 4096]
W_Q: [4096, 4096]
FLOPs = 2 × 1 × 4096 × 4096 = 33.6 million
Bytes = 2 × (1×4096 + 4096×4096 + 1×4096) ≈ 67 MB (dominated by W_Q)
Arithmetic Intensity = 33.6M / 67M = 0.5 FLOPs/byte
This is 312× BELOW the threshold!
Prefill (512 tokens) — for comparison:
Arithmetic Intensity = 410 FLOPs/byte (from [Artifact 4](https://huggingface.co/blog/atharv6f/prefill-computational-deep-dive))
The same weight matrix, the same memory read, but 1/512th the computation. The GPU reads 67 MB of weights to perform a trivial amount of math.
All Decode Operations Analysis
┌─────────────────────────────────────────────────────────────────────────┐
│ DECODE OPERATIONS (1 token, seq_len=1000) │
├──────────────────────┬──────────────┬─────────────┬────────────────────┤
│ Operation │ FLOPs │ Bytes │ Arithmetic │
│ │ │ │ Intensity │
├──────────────────────┼──────────────┼─────────────┼────────────────────┤
│ Q projection │ 33.6 M │ 67 MB │ 0.5 FLOPs/byte ✗ │
│ K projection │ 33.6 M │ 67 MB │ 0.5 FLOPs/byte ✗ │
│ V projection │ 33.6 M │ 67 MB │ 0.5 FLOPs/byte ✗ │
│ KV cache read │ ~0 │ 32 MB │ ~0 FLOPs/byte ✗ │
│ Q @ Kᵀ (attention) │ 0.26 M │ ~0 (cached) │ high (but tiny) │
│ softmax @ V │ 0.26 M │ ~0 (cached) │ high (but tiny) │
│ Output projection │ 33.6 M │ 67 MB │ 0.5 FLOPs/byte ✗ │
│ FFN layer 1 │ 134 M │ 268 MB │ 0.5 FLOPs/byte ✗ │
│ FFN layer 2 │ 134 M │ 268 MB │ 0.5 FLOPs/byte ✗ │
├──────────────────────┼──────────────┼─────────────┼────────────────────┤
│ Total per layer │ ~400 M │ ~840 MB │ 0.5 FLOPs/byte ✗ │
│ All 32 layers │ ~13 B │ ~27 GB │ 0.5 FLOPs/byte ✗ │
│ + Output projection │ +0.26 B │ +0.5 GB │ │
├──────────────────────┼──────────────┼─────────────┼────────────────────┤
│ TOTAL per decode step│ ~13 B │ ~28 GB │ 0.5 FLOPs/byte ✗ │
└──────────────────────┴──────────────┴─────────────┴────────────────────┘
Required for compute-bound: 156 FLOPs/byte
Actual achieved: 0.5 FLOPs/byte
We are 312× below the threshold. The GPU is massively underutilized.
The Memory Bandwidth Bottleneck
Since decode is memory-bandwidth-bound, the time is determined by how fast we can read data, not how fast we can compute.
Time Calculation
Decode time ≈ Bytes to transfer / Memory bandwidth
For one decode step on LLaMA-7B on A100:
Bytes to transfer:
• Model weights: ~14 GB (read all weights once per layer × 32 layers)
• KV cache read: ~32 MB × 32 layers = 1 GB (for seq_len=1000)
• KV cache write: ~1 MB × 32 layers = 32 MB (append new K,V)
• Activations: negligible
Total: ~15 GB per decode step
A100 memory bandwidth: 2 TB/s
Time = 15 GB / 2 TB/s = 7.5 ms
(Real-world: ~8-12 ms due to overhead, cache effects, etc.)
Notice what determines the time: reading the model weights. Every decode step must read essentially all model parameters from GPU memory. The actual computation (13 billion FLOPs) could be done in microseconds if the data were already available.
Where the Time Goes
┌─────────────────────────────────────────────────────────────────────────┐
│ TIME BREAKDOWN: ONE DECODE STEP │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Total time: ~10 ms │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │████████████████████████████████████████████████████████████████████│ │
│ │█████████████████ READING WEIGHTS ██████████████████████████████████│ │
│ │████████████████████████████████████████████████████████████████████│ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ ▲ ▲ │
│ 0 ms ~7 ms 10 ms│
│ │
│ ┌──────────────────────┐ │
│ │░░ KV Cache Read ░░░░░│ ~0.5 ms (grows with sequence length) │
│ └──────────────────────┘ │
│ │
│ ┌────┐ │
│ │Comp│ < 0.1 ms (actual arithmetic) │
│ └────┘ │
│ │
│ The GPU spends ~95% of its time waiting for data from memory. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
GPU Utilization During Decode
Because decode is memory-bound, GPU compute utilization is extremely low:
Decode GPU Utilization (typical):
┌────────────────────────────────────────────────────────────────┐
│████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
│███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
│████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
│███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
└────────────────────────────────────────────────────────────────┘
▲ ▲ ▲
0% ~5-10% 100%
Theoretical utilization = Actual FLOPs / Peak FLOPs
= 13B / (312 TFLOPS × 0.010s)
= 13B / 3.12T
= 0.4%
Even accounting for overhead, utilization is typically < 10%.
You're paying for a $10,000 GPU and using < 10% of its compute power.
This is the fundamental inefficiency of LLM inference that drives most optimization research.
Why You Must Read Weights Every Step
You might wonder: "Why read the weights every step? Can't we keep them in cache?"
The answer reveals a fundamental constraint of GPU architecture.
GPU Memory Hierarchy
┌─────────────────────────────────────────────────────────────────────────┐
│ GPU MEMORY HIERARCHY │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Registers │ Fastest (~20 TB/s effective) │
│ │ ~256 KB/SM │ But tiny capacity │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ L1 Cache / │ Very fast (~10 TB/s) │
│ │ Shared Memory │ ~128-256 KB per SM │
│ │ │ Total: ~10-20 MB across GPU │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ L2 Cache │ Fast (~5 TB/s) │
│ │ ~40-50 MB │ Shared across all SMs │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ HBM (Main │ "Slow" (2 TB/s) │
│ │ GPU Memory) │ But large: 40-80 GB │
│ │ │ This is where model weights live │
│ └─────────────────┘ │
│ │
│ LLaMA-7B weights: ~14 GB │
│ L2 cache: ~50 MB │
│ │
│ The weights are 280× larger than the cache! │
│ They MUST be read from HBM for each operation. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The model weights simply don't fit in cache. Every decode step must stream them from HBM (main GPU memory), which is the bandwidth bottleneck.
The Decode Data Flow
For each decode step:
HBM (GPU Memory) Compute Units
┌──────────────┐ ┌──────────┐
│ │ Stream weights │ │
│ Weights │ ───────────────────────►│ Matrix │
│ (14 GB) │ @ 2 TB/s │ Multiply│
│ │ │ │
├──────────────┤ Stream KV cache │ │
│ KV Cache │ ───────────────────────►│ (for │
│ (1+ GB) │ @ 2 TB/s │ 1 token) │
│ │ │ │
└──────────────┘ └──────────┘
│ │
│ ~7-10 ms │ ~0.04 ms
│ (memory transfer) │ (compute)
│ │
└──────────────────────────────────────┘
The compute units finish in microseconds, then wait
for the next batch of data to arrive from memory.
Time Complexity of Decode
Time Per Step
Each decode step time is dominated by memory bandwidth:
Time per decode step ≈ (Model size + KV cache size) / Memory bandwidth
For LLaMA-7B at sequence position s:
Model: 14 GB (constant)
KV cache: ~0.5 MB × s (grows linearly with sequence length)
At s=500: Time ≈ (14 GB + 0.25 GB) / 2 TB/s ≈ 7.1 ms
At s=1000: Time ≈ (14 GB + 0.5 GB) / 2 TB/s ≈ 7.3 ms
At s=4000: Time ≈ (14 GB + 2 GB) / 2 TB/s ≈ 8.0 ms
At s=16000: Time ≈ (14 GB + 8 GB) / 2 TB/s ≈ 11.0 ms
For moderate sequences, time per step is roughly constant (dominated by model weights). For very long sequences, the KV cache read starts to matter.
Total Decode Time
To generate G tokens:
Total decode time ≈ G × (time per step)
≈ G × (Model size / Bandwidth)
≈ O(G) — linear in generated tokens
For 200 tokens on LLaMA-7B:
Time ≈ 200 × 7.5 ms = 1500 ms = 1.5 seconds
Plus KV cache overhead as sequence grows (minor for moderate lengths).
Scaling Behavior
┌─────────────────────────────────────────────────────────────────────────┐
│ DECODE TIME VS TOKENS GENERATED │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Time (ms) │
│ ▲ │
│ │ × │
│ 4000 ┤ × │
│ │ × │
│ 3000 ┤ × │
│ │ × │
│ 2000 ┤ × │
│ │ × │
│ 1000 ┤ × │
│ │ × │
│ 500 ┤ × │
│ │ × │
│ └──────────────────────────────────────────────────────────────────►
│ 50 100 150 200 250 300 350 400 450 500 │
│ Tokens generated │
│ │
│ Nearly linear: each token costs ~7-10 ms regardless of position │
│ (slight increase for very long sequences due to KV cache growth) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Prefill vs Decode: The Complete Picture
Let's put everything together with a direct comparison:
┌─────────────────────────────────────────────────────────────────────────┐
│ PREFILL VS DECODE: COMPLETE COMPARISON │
├────────────────────────┬────────────────────────┬───────────────────────┤
│ Metric │ PREFILL │ DECODE │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Tokens per forward pass│ N (all prompt tokens) │ 1 │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Arithmetic intensity │ ~400-500 FLOPs/byte │ ~0.5 FLOPs/byte │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Bound by │ Compute (FLOPS) │ Memory bandwidth │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ GPU utilization │ 70-85% │ < 10% │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Time determined by │ FLOPs / Compute power │ Bytes / Bandwidth │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Bottleneck │ How fast GPU computes │ How fast GPU reads │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Weight matrices │ Read once, used N times│ Read once, used once │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Scales with │ O(N) to O(N²) │ O(G) linear │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Example time │ 512 tokens: ~30-50 ms │ 200 tokens: ~1500 ms │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Parallelizable │ Yes (within pass) │ No (sequential) │
└────────────────────────┴────────────────────────┴───────────────────────┘
The Fundamental Inefficiency
This is the core problem of LLM inference:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE DECODE INEFFICIENCY PROBLEM │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ For each decode step: │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ READ: ~14 GB of model weights from GPU memory │ │
│ │ COMPUTE: ~13 billion FLOPs (trivial for GPU) │ │
│ │ OUTPUT: 1 token │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ You read the ENTIRE MODEL to produce ONE TOKEN. │
│ │
│ The GPU is a supercomputer that could do 312 trillion ops/second, │
│ but it's stuck waiting for data 95% of the time. │
│ │
│ It's like having a Formula 1 car stuck in traffic. │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ This is why inference optimization matters. │
│ This is why batching helps (more on this in Artifact 7). │
│ This is why model compression, quantization, and speculative │
│ decoding exist—they all attack this fundamental bottleneck. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary: Why Decode Is Slow
Batch size of 1: Each step processes exactly one token, so we can't amortize memory reads across multiple tokens.
Low arithmetic intensity (~0.5 FLOPs/byte): Far below the ~156 FLOPs/byte threshold needed for compute-bound operation.
Memory-bandwidth-bound: Time is determined by how fast we can read model weights from GPU memory, not by computation speed.
Must read all weights every step: The model weights (~14 GB for LLaMA-7B) don't fit in cache and must be streamed from HBM for every single token.
Sequential dependency: Can't parallelize across tokens because token N+1 depends on token N.
GPU underutilization: Despite having enormous compute capability, the GPU sits idle ~90-95% of the time waiting for data.
The result: Generating 200 tokens takes ~1.5-2 seconds even on a $10,000+ GPU, with most of that time spent waiting for memory transfers.
Check Your Understanding
Before moving on:
If you double the memory bandwidth of a GPU (keeping compute constant), roughly how much faster would decode get? What about prefill?
Why doesn't processing 1 token take 1/512th the time of processing 512 tokens, even though it's 1/512th the computation?
A user complains: "I have an A100 but decode is slow. Should I upgrade to a GPU with more FLOPS?" What would you tell them?