1.6: The Utilization Paradox — Quantitative Analysis
Setting Up the Concrete Example
Let's work through a complete, realistic example with actual numbers. This will solidify everything we've learned and reveal the "utilization paradox" quantitatively.
Scenario:
- Model: LLaMA-7B
- GPU: NVIDIA A100 (80GB)
- Prompt: 500 tokens
- Generated output: 200 tokens
- Total sequence at end: 700 tokens
Hardware specs we'll use:
- A100 compute: 312 TFLOPS (FP16)
- A100 memory bandwidth: 2 TB/s
- Model size: ~14 GB (FP16 weights)
- KV cache per token: ~0.5 MB (for LLaMA-7B)
Phase 1: Prefill Analysis
Computation Required
For prefill with N=500 tokens through LLaMA-7B (32 layers, hidden_dim=4096):
Per layer:
Q, K, V projections: 3 × 2 × N × d² = 3 × 2 × 500 × 4096² = 50.3 B FLOPs
Attention (Q @ Kᵀ): 2 × N² × d = 2 × 500² × 4096 = 2.0 B FLOPs
Attention (weights @ V): 2 × N² × d = 2 × 500² × 4096 = 2.0 B FLOPs
Output projection: 2 × N × d² = 2 × 500 × 4096² = 16.8 B FLOPs
FFN (two layers): 2 × 2 × N × d × 4d = 4 × 500 × 4096 × 16384 = 134.2 B FLOPs
Total per layer: ~205 B FLOPs
All 32 layers: 32 × 205 B = 6.56 T FLOPs
Output projection to vocab: 2 × 500 × 4096 × 32000 = 131 B FLOPs
Total prefill FLOPs: ~6.7 trillion
Memory Transfers
Read model weights: 14 GB (once for the forward pass)
Write KV cache: 500 tokens × 0.5 MB/token = 250 MB
Read/write activations: ~negligible compared to weights
Total memory transfer: ~14.3 GB
Arithmetic Intensity
Arithmetic Intensity = 6.7 T FLOPs / 14.3 GB
= 6.7 × 10¹² / 14.3 × 10⁹
= 468 FLOPs/byte
This is well above the 156 FLOPs/byte threshold → COMPUTE-BOUND ✓
Time Calculation
Since prefill is compute-bound, time is determined by compute throughput:
Prefill time ≈ Total FLOPs / GPU throughput
= 6.7 T / 312 T/s
= 21.5 ms (theoretical minimum)
Real-world (accounting for overhead, memory latency, kernel launch, etc.):
Prefill time ≈ 35-50 ms
Let's use 40 ms for our analysis.
GPU Utilization During Prefill
Theoretical utilization = Achieved FLOPS / Peak FLOPS
= (6.7 T / 0.040 s) / 312 T/s
= 167.5 T/s / 312 T/s
= 53.7%
Accounting for the fact that not all operations are perfectly parallelizable:
Real-world utilization ≈ 50-70%
This is GOOD utilization for real workloads.
Phase 2: Decode Analysis
Per-Step Computation
For each decode step (processing 1 token):
Per layer:
Q, K, V projections: 3 × 2 × 1 × d² = 3 × 2 × 1 × 4096² = 100.7 M FLOPs
Attention (Q @ Kᵀ): 2 × 1 × seq_len × head_dim = varies with position
Attention (weights @ V): 2 × 1 × seq_len × head_dim = varies with position
Output projection: 2 × 1 × d² = 33.6 M FLOPs
FFN: 4 × 1 × d × 4d = 268.4 M FLOPs
Total per layer (excluding attention): ~402 M FLOPs
All 32 layers: ~12.9 B FLOPs
Output projection: 2 × 1 × 4096 × 32000 = 262 M FLOPs
Total per decode step: ~13.2 B FLOPs (relatively constant)
Per-Step Memory Transfers
Read model weights: 14 GB (must read ALL weights for each token!)
Read KV cache: grows from 250 MB (start) to 350 MB (end)
Write KV cache: 0.5 MB (append one token's K,V)
Average memory transfer per step: ~14.3 GB
Arithmetic Intensity (Decode)
Arithmetic Intensity = 13.2 B FLOPs / 14.3 GB
= 13.2 × 10⁹ / 14.3 × 10⁹
= 0.92 FLOPs/byte
This is 170× BELOW the 156 FLOPs/byte threshold → MEMORY-BOUND ✗
Time Per Decode Step
Since decode is memory-bound, time is determined by bandwidth:
Decode step time ≈ Bytes transferred / Memory bandwidth
= 14.3 GB / 2 TB/s
= 7.15 ms (theoretical minimum)
Real-world (overhead, cache effects, kernel launch):
Decode step time ≈ 8-12 ms
Let's use 10 ms for our analysis.
Total Decode Time
Total decode time = 200 tokens × 10 ms/token = 2000 ms = 2.0 seconds
GPU Utilization During Decode
Theoretical utilization = Achieved FLOPS / Peak FLOPS
= (13.2 B / 0.010 s) / 312 T/s
= 1.32 T/s / 312 T/s
= 0.42%
Real-world utilization ≈ 1-5%
This is TERRIBLE utilization. The GPU is 95-99% idle!
The Complete Picture: Time Breakdown
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPLETE REQUEST: 500 prompt tokens → 200 generated │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PREFILL PHASE │
│ ┌────────┐ │
│ │████████│ 40 ms │
│ └────────┘ │
│ • Processes: 500 tokens │
│ • FLOPs: 6.7 trillion │
│ • GPU utilization: ~50-70% │
│ • Bound by: Compute │
│ │
│ DECODE PHASE │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │████████████████████████████████████████████████████████████████████│ │
│ │████████████████████████████████████████████████████████████████████│ │
│ │████████████████████████████████████████████████████████████████████│ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ 2000 ms (50× longer than prefill!) │
│ • Processes: 200 tokens (one at a time) │
│ • FLOPs: 200 × 13.2 B = 2.64 trillion (less than prefill!) │
│ • GPU utilization: ~1-5% │
│ • Bound by: Memory bandwidth │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ TOTAL TIME: 2040 ms │
│ • Prefill: 40 ms (2.0% of total) │
│ • Decode: 2000 ms (98.0% of total) │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Utilization Paradox Explained
Here's the paradox in stark terms:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE UTILIZATION PARADOX │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PREFILL DECODE │
│ ─────── ────── │
│ Tokens processed: 500 200 │
│ Total FLOPs: 6.7 T 2.64 T │
│ Time taken: 40 ms 2000 ms │
│ GPU utilization: ~60% ~2% │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PARADOX #1: Decode does LESS computation but takes MORE time │
│ │
│ Prefill: 6.7 T FLOPs in 40 ms = 167 TFLOPS achieved │
│ Decode: 2.64 T FLOPs in 2000 ms = 1.3 TFLOPS achieved │
│ │
│ Decode achieves 128× LOWER throughput despite same hardware! │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PARADOX #2: Lower utilization means LONGER time, not shorter │
│ │
│ Intuition says: "If GPU is only 2% busy, the work must be easy" │
│ Reality: "GPU is 2% busy because it's WAITING, not because │
│ the work is easy" │
│ │
│ Low utilization = memory bottleneck = slow │
│ High utilization = compute bottleneck = fast (for the work done) │
│ │
│ ───────────────────────────────────────────────────────────────────── │
│ │
│ PARADOX #3: Processing 1 token is not 500× faster than 500 tokens │
│ │
│ If computation scaled linearly: │
│ 1 token should take: 40 ms / 500 = 0.08 ms │
│ │
│ Actual time for 1 token: 10 ms │
│ │
│ That's 125× slower than linear scaling would predict! │
│ │
│ Why? The memory reads don't scale down with batch size. │
│ You read 14 GB of weights whether processing 1 or 500 tokens. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Visualizing the Time Budget
Let's see where every millisecond goes:
┌─────────────────────────────────────────────────────────────────────────┐
│ WHERE DOES THE TIME GO? │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PREFILL (40 ms): │
│ ┌──────────────────────────────────────┬───────────────┐ │
│ │ Computing (matrix ops) │ Memory xfer │ │
│ │ ~30 ms │ ~10 ms │ │
│ └──────────────────────────────────────┴───────────────┘ │
│ The GPU is mostly COMPUTING. Memory transfers overlap with compute. │
│ │
│ ─────────────────────────────────────────────────────────────────── │
│ │
│ DECODE (each 10 ms step): │
│ ┌───┬──────────────────────────────────────────────────────────────┐ │
│ │Cal│ Waiting for memory │ │
│ │0.5│ ~9.5 ms │ │
│ └───┴──────────────────────────────────────────────────────────────┘ │
│ The GPU is mostly WAITING. Compute finishes almost instantly. │
│ │
│ ─────────────────────────────────────────────────────────────────── │
│ │
│ FULL REQUEST TIMELINE: │
│ │
│ 0 ms 40 ms 2040 ms │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────┬──────────────────────────────────────────────────────┐ │
│ │▓▓PREFILL▓▓▓│░░░░░░░░░░░░░░░░░ DECODE ░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ │
│ │ 500 tokens │ token token token token ... (200 tokens) ... token │ │
│ │ parallel │ 1 2 3 4 200 │ │
│ └────────────┴──────────────────────────────────────────────────────┘ │
│ │
│ ▓▓▓ = High GPU utilization (compute-bound) │
│ ░░░ = Low GPU utilization (memory-bound) │
│ │
│ 98% of your inference time is spent with the GPU mostly idle. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Cost Analysis: What Are You Paying For?
Let's translate utilization into dollars:
┌─────────────────────────────────────────────────────────────────────────┐
│ COST EFFICIENCY ANALYSIS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ A100 GPU cost: ~$2/hour (cloud) or ~$15,000 (purchase) │
│ A100 peak capability: 312 TFLOPS │
│ │
│ For our 2.04 second request: │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Phase │ Time │ FLOPs │ Utilization │ "Wasted" capacity │ │
│ ├──────────┼─────────┼──────────┼─────────────┼───────────────────┤ │
│ │ Prefill │ 40 ms │ 6.7 T │ 54% │ 46% │ │
│ │ Decode │ 2000 ms │ 2.64 T │ 0.4% │ 99.6% │ │
│ └──────────┴─────────┴──────────┴─────────────┴───────────────────┘ │
│ │
│ During decode, for every 1 FLOP of useful work, the GPU COULD have │
│ done 250 FLOPs. You're paying for 250× more compute than you use. │
│ │
│ ─────────────────────────────────────────────────────────────────── │
│ │
│ Total compute available in 2.04 s: 312 T × 2.04 = 636 T FLOPs │
│ Total compute used: 6.7 T + 2.64 T = 9.34 T FLOPs │
│ Overall utilization: 9.34 / 636 = 1.5% │
│ │
│ You're using 1.5% of what you're paying for! │
│ │
└─────────────────────────────────────────────────────────────────────────┘
This is why inference optimization matters economically. If you can increase utilization from 1.5% to 15% (10×), you can serve 10× more requests with the same hardware, or reduce your GPU costs by 90%.
Scaling Analysis: What Happens With Different Inputs?
Let's see how time breakdown changes with different prompt/generation lengths:
┌─────────────────────────────────────────────────────────────────────────┐
│ TIME BREAKDOWN FOR DIFFERENT REQUEST PROFILES │
├───────────────────┬──────────┬──────────┬──────────┬───────────────────┤
│ Scenario │ Prefill │ Decode │ Total │ Decode % of total │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Short prompt, │ │ │ │ │
│ short output │ │ │ │ │
│ (100 → 50) │ 10 ms │ 500 ms │ 510 ms │ 98.0% │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Medium prompt, │ │ │ │ │
│ medium output │ │ │ │ │
│ (500 → 200) │ 40 ms │ 2000 ms │ 2040 ms │ 98.0% │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Long prompt, │ │ │ │ │
│ short output │ │ │ │ │
│ (2000 → 50) │ 180 ms │ 500 ms │ 680 ms │ 73.5% │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Short prompt, │ │ │ │ │
│ long output │ │ │ │ │
│ (100 → 500) │ 10 ms │ 5000 ms │ 5010 ms │ 99.8% │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Long prompt, │ │ │ │ │
│ long output │ │ │ │ │
│ (2000 → 500) │ 180 ms │ 5000 ms │ 5180 ms │ 96.5% │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Very long prompt, │ │ │ │ │
│ short output │ │ │ │ │
│ (8000 → 50) │ 1200 ms │ 550 ms │ 1750 ms │ 31.4% │
└───────────────────┴──────────┴──────────┴──────────┴───────────────────┘
Key observations:
• For most typical use cases, decode dominates (>95% of time)
• Only with very long prompts AND short outputs does prefill become significant
• The longer the generated output, the more decode dominates
• This is why most optimization research focuses on decode
Tokens Per Second: The User-Facing Metric
Users experience inference speed as "tokens per second." Let's calculate this:
┌─────────────────────────────────────────────────────────────────────────┐
│ TOKENS PER SECOND ANALYSIS │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ For our example (500 prompt → 200 generated): │
│ │
│ Time to first token (TTFT): │
│ = Prefill time │
│ = 40 ms │
│ User waits 40 ms before seeing any output. │
│ │
│ Time between tokens (TBT) / Inter-token latency: │
│ = Decode step time │
│ = 10 ms per token │
│ = 100 tokens/second generation rate │
│ │
│ Total generation time (for 200 tokens): │
│ = TTFT + (200 × TBT) │
│ = 40 ms + 2000 ms │
│ = 2040 ms │
│ │
│ Effective throughput: │
│ = 200 tokens / 2.04 seconds │
│ = 98 tokens/second (output only) │
│ │
│ ─────────────────────────────────────────────────────────────────── │
│ │
│ User experience: │
│ • Wait 40 ms (imperceptible) │
│ • Text streams at ~100 tokens/sec (~400-500 characters/sec) │
│ • Total wait for full response: ~2 seconds │
│ │
│ This feels reasonably fast for interactive use, but: │
│ • At scale (1000s of users), this GPU serves ~50 requests/sec max │
│ • Each request monopolizes the GPU for ~2 seconds │
│ • Cost per request is significant at cloud GPU prices │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Key Metrics Summary
┌─────────────────────────────────────────────────────────────────────────┐
│ KEY INFERENCE METRICS (500 → 200 tokens, LLaMA-7B) │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LATENCY METRICS │
│ ──────────────── │
│ Time to First Token (TTFT): 40 ms │
│ Inter-Token Latency (ITL): 10 ms/token │
│ Total Generation Time: 2040 ms │
│ │
│ THROUGHPUT METRICS │
│ ───────────────── │
│ Generation rate: 100 tokens/sec │
│ Effective throughput: 98 tokens/sec │
│ │
│ EFFICIENCY METRICS │
│ ───────────────── │
│ Prefill GPU utilization: ~55% │
│ Decode GPU utilization: ~1-2% │
│ Overall GPU utilization: ~1.5% │
│ Decode share of total time: 98% │
│ │
│ HARDWARE UTILIZATION │
│ ──────────────────── │
│ Compute used: 9.34 TFLOPS │
│ Compute available: 636 TFLOPS │
│ Utilization ratio: 1.5% │
│ │
│ MEMORY BANDWIDTH │
│ ──────────────── │
│ Prefill bandwidth utilization: ~35% (14 GB / 40 ms / 2 TB/s) │
│ Decode bandwidth utilization: ~70% (14 GB / 10 ms / 2 TB/s) │
│ │
│ Note: Decode achieves HIGHER bandwidth utilization because it's │
│ memory-bound. Prefill is compute-bound so bandwidth isn't saturated. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Why This Matters: The Optimization Imperative
This quantitative analysis reveals why LLM inference is such an active area of research and engineering:
┌─────────────────────────────────────────────────────────────────────────┐
│ THE OPTIMIZATION IMPERATIVE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: 98% of inference time is spent in a low-efficiency phase │
│ │
│ IMPLICATION: Even small improvements to decode have large impact │
│ │
│ ─────────────────────────────────────────────────────────────────── │
│ │
│ If you improve decode by 2× (5 ms/token instead of 10 ms): │
│ Old total time: 40 + 2000 = 2040 ms │
│ New total time: 40 + 1000 = 1040 ms │
│ Speedup: 1.96× for end-to-end latency │
│ │
│ If you improve prefill by 2× (20 ms instead of 40 ms): │
│ Old total time: 40 + 2000 = 2040 ms │
│ New total time: 20 + 2000 = 2020 ms │
│ Speedup: 1.01× (barely noticeable) │
│ │
│ ─────────────────────────────────────────────────────────────────── │
│ │
│ This is why optimization research focuses heavily on decode: │
│ │
│ • Batching: Process multiple requests together to increase │
│ arithmetic intensity during decode │
│ │
│ • Quantization: Reduce model size (14 GB → 3.5 GB) to reduce │
│ memory bandwidth requirements │
│ │
│ • Speculative decoding: Generate multiple tokens per decode step │
│ to amortize the memory bandwidth cost │
│ │
│ • KV cache compression: Reduce cache size to lower memory reads │
│ │
│ • Continuous batching: Maximize GPU utilization across requests │
│ │
│ All of these target the decode bottleneck in different ways. │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Summary: The Numbers That Matter
For typical LLM inference (500 prompt tokens → 200 generated tokens on LLaMA-7B):
Prefill takes ~40 ms (2% of total time), achieves ~55% GPU utilization, is compute-bound.
Decode takes ~2000 ms (98% of total time), achieves ~2% GPU utilization, is memory-bandwidth-bound.
Each decode step reads ~14 GB of model weights to produce one token.
Overall GPU utilization is ~1.5% — you're paying for 66× more compute than you use.
The utilization paradox: Low utilization doesn't mean "easy work" — it means "waiting for memory." That's why decode is slow despite doing less computation than prefill.
Optimization leverage: 2× improvement in decode → ~2× end-to-end speedup. 2× improvement in prefill → ~1% end-to-end speedup. Focus on decode.
Check Your Understanding
Before moving on:
If you're serving LLaMA-7B on an A100 and your users complain about latency, which phase should you optimize first? Why?
A colleague says "Our GPU utilization is only 2%, so we should be able to handle 50× more traffic." What's wrong with this reasoning?
For a use case with 4000-token prompts and 50-token outputs, how would the prefill vs decode time breakdown differ from our 500→200 example?