1.6: The Utilization Paradox — Quantitative Analysis

Published January 27, 2026

Setting Up the Concrete Example
Phase 1: Prefill Analysis
Computation Required
Memory Transfers
Arithmetic Intensity
Time Calculation
GPU Utilization During Prefill
Phase 2: Decode Analysis
Per-Step Computation
Per-Step Memory Transfers
Arithmetic Intensity (Decode)
Time Per Decode Step
Total Decode Time
GPU Utilization During Decode
The Complete Picture: Time Breakdown
The Utilization Paradox Explained
Visualizing the Time Budget
Cost Analysis: What Are You Paying For?
Scaling Analysis: What Happens With Different Inputs?
Tokens Per Second: The User-Facing Metric
The Key Metrics Summary
Why This Matters: The Optimization Imperative
Summary: The Numbers That Matter
Check Your Understanding
Setting Up the Concrete Example

Let's work through a complete, realistic example with actual numbers. This will solidify everything we've learned and reveal the "utilization paradox" quantitatively.

Scenario:

Model: LLaMA-7B
GPU: NVIDIA A100 (80GB)
Prompt: 500 tokens
Generated output: 200 tokens
Total sequence at end: 700 tokens

Hardware specs we'll use:

A100 compute: 312 TFLOPS (FP16)
A100 memory bandwidth: 2 TB/s
Model size: ~14 GB (FP16 weights)
KV cache per token: ~0.5 MB (for LLaMA-7B)

Phase 1: Prefill Analysis

Computation Required

For prefill with N=500 tokens through LLaMA-7B (32 layers, hidden_dim=4096):

Per layer:
  Q, K, V projections:    3 × 2 × N × d² = 3 × 2 × 500 × 4096² = 50.3 B FLOPs
  Attention (Q @ Kᵀ):     2 × N² × d = 2 × 500² × 4096 = 2.0 B FLOPs
  Attention (weights @ V): 2 × N² × d = 2 × 500² × 4096 = 2.0 B FLOPs  
  Output projection:      2 × N × d² = 2 × 500 × 4096² = 16.8 B FLOPs
  FFN (two layers):       2 × 2 × N × d × 4d = 4 × 500 × 4096 × 16384 = 134.2 B FLOPs
  
  Total per layer: ~205 B FLOPs

All 32 layers: 32 × 205 B = 6.56 T FLOPs
Output projection to vocab: 2 × 500 × 4096 × 32000 = 131 B FLOPs

Total prefill FLOPs: ~6.7 trillion

Memory Transfers

Read model weights: 14 GB (once for the forward pass)
Write KV cache: 500 tokens × 0.5 MB/token = 250 MB
Read/write activations: ~negligible compared to weights

Total memory transfer: ~14.3 GB

Arithmetic Intensity

Arithmetic Intensity = 6.7 T FLOPs / 14.3 GB 
                     = 6.7 × 10¹² / 14.3 × 10⁹
                     = 468 FLOPs/byte

This is well above the 156 FLOPs/byte threshold → COMPUTE-BOUND ✓

Time Calculation

Since prefill is compute-bound, time is determined by compute throughput:

Prefill time ≈ Total FLOPs / GPU throughput
             = 6.7 T / 312 T/s
             = 21.5 ms (theoretical minimum)

Real-world (accounting for overhead, memory latency, kernel launch, etc.):
Prefill time ≈ 35-50 ms

Let's use 40 ms for our analysis.

GPU Utilization During Prefill

Theoretical utilization = Achieved FLOPS / Peak FLOPS
                       = (6.7 T / 0.040 s) / 312 T/s
                       = 167.5 T/s / 312 T/s
                       = 53.7%

Accounting for the fact that not all operations are perfectly parallelizable:
Real-world utilization ≈ 50-70%

This is GOOD utilization for real workloads.

Phase 2: Decode Analysis

Per-Step Computation

For each decode step (processing 1 token):

Per layer:
  Q, K, V projections:    3 × 2 × 1 × d² = 3 × 2 × 1 × 4096² = 100.7 M FLOPs
  Attention (Q @ Kᵀ):     2 × 1 × seq_len × head_dim = varies with position
  Attention (weights @ V): 2 × 1 × seq_len × head_dim = varies with position
  Output projection:      2 × 1 × d² = 33.6 M FLOPs
  FFN:                    4 × 1 × d × 4d = 268.4 M FLOPs
  
  Total per layer (excluding attention): ~402 M FLOPs
  
All 32 layers: ~12.9 B FLOPs
Output projection: 2 × 1 × 4096 × 32000 = 262 M FLOPs

Total per decode step: ~13.2 B FLOPs (relatively constant)

Per-Step Memory Transfers

Read model weights: 14 GB (must read ALL weights for each token!)
Read KV cache: grows from 250 MB (start) to 350 MB (end)
Write KV cache: 0.5 MB (append one token's K,V)

Average memory transfer per step: ~14.3 GB

Arithmetic Intensity (Decode)

Arithmetic Intensity = 13.2 B FLOPs / 14.3 GB
                     = 13.2 × 10⁹ / 14.3 × 10⁹
                     = 0.92 FLOPs/byte

This is 170× BELOW the 156 FLOPs/byte threshold → MEMORY-BOUND ✗

Time Per Decode Step

Since decode is memory-bound, time is determined by bandwidth:

Decode step time ≈ Bytes transferred / Memory bandwidth
                 = 14.3 GB / 2 TB/s
                 = 7.15 ms (theoretical minimum)

Real-world (overhead, cache effects, kernel launch):
Decode step time ≈ 8-12 ms

Let's use 10 ms for our analysis.

Total Decode Time

Total decode time = 200 tokens × 10 ms/token = 2000 ms = 2.0 seconds

GPU Utilization During Decode

Theoretical utilization = Achieved FLOPS / Peak FLOPS
                       = (13.2 B / 0.010 s) / 312 T/s
                       = 1.32 T/s / 312 T/s
                       = 0.42%

Real-world utilization ≈ 1-5%

This is TERRIBLE utilization. The GPU is 95-99% idle!

The Complete Picture: Time Breakdown

┌─────────────────────────────────────────────────────────────────────────┐
│           COMPLETE REQUEST: 500 prompt tokens → 200 generated          │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PREFILL PHASE                                                          │
│  ┌────────┐                                                             │
│  │████████│  40 ms                                                      │
│  └────────┘                                                             │
│  • Processes: 500 tokens                                                │
│  • FLOPs: 6.7 trillion                                                  │
│  • GPU utilization: ~50-70%                                             │
│  • Bound by: Compute                                                    │
│                                                                         │
│  DECODE PHASE                                                           │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │████████████████████████████████████████████████████████████████████│ │
│  │████████████████████████████████████████████████████████████████████│ │
│  │████████████████████████████████████████████████████████████████████│ │
│  └────────────────────────────────────────────────────────────────────┘ │
│  2000 ms (50× longer than prefill!)                                     │
│  • Processes: 200 tokens (one at a time)                                │
│  • FLOPs: 200 × 13.2 B = 2.64 trillion (less than prefill!)            │
│  • GPU utilization: ~1-5%                                               │
│  • Bound by: Memory bandwidth                                           │
│                                                                         │
│  ═══════════════════════════════════════════════════════════════════    │
│                                                                         │
│  TOTAL TIME: 2040 ms                                                    │
│  • Prefill: 40 ms (2.0% of total)                                       │
│  • Decode: 2000 ms (98.0% of total)                                     │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Utilization Paradox Explained

Here's the paradox in stark terms:

┌─────────────────────────────────────────────────────────────────────────┐
│                       THE UTILIZATION PARADOX                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│                        PREFILL              DECODE                      │
│                        ───────              ──────                      │
│  Tokens processed:     500                  200                         │
│  Total FLOPs:          6.7 T                2.64 T                      │
│  Time taken:           40 ms                2000 ms                     │
│  GPU utilization:      ~60%                 ~2%                         │
│                                                                         │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                         │
│  PARADOX #1: Decode does LESS computation but takes MORE time           │
│                                                                         │
│    Prefill:  6.7 T FLOPs in 40 ms   = 167 TFLOPS achieved              │
│    Decode:   2.64 T FLOPs in 2000 ms = 1.3 TFLOPS achieved             │
│                                                                         │
│    Decode achieves 128× LOWER throughput despite same hardware!         │
│                                                                         │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                         │
│  PARADOX #2: Lower utilization means LONGER time, not shorter           │
│                                                                         │
│    Intuition says: "If GPU is only 2% busy, the work must be easy"     │
│    Reality: "GPU is 2% busy because it's WAITING, not because          │
│              the work is easy"                                          │
│                                                                         │
│    Low utilization = memory bottleneck = slow                           │
│    High utilization = compute bottleneck = fast (for the work done)    │
│                                                                         │
│  ─────────────────────────────────────────────────────────────────────  │
│                                                                         │
│  PARADOX #3: Processing 1 token is not 500× faster than 500 tokens      │
│                                                                         │
│    If computation scaled linearly:                                      │
│      1 token should take: 40 ms / 500 = 0.08 ms                        │
│                                                                         │
│    Actual time for 1 token: 10 ms                                       │
│                                                                         │
│    That's 125× slower than linear scaling would predict!               │
│                                                                         │
│    Why? The memory reads don't scale down with batch size.              │
│    You read 14 GB of weights whether processing 1 or 500 tokens.        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Visualizing the Time Budget

Let's see where every millisecond goes:

┌─────────────────────────────────────────────────────────────────────────┐
│                    WHERE DOES THE TIME GO?                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PREFILL (40 ms):                                                       │
│  ┌──────────────────────────────────────┬───────────────┐               │
│  │      Computing (matrix ops)          │ Memory xfer   │               │
│  │           ~30 ms                     │    ~10 ms     │               │
│  └──────────────────────────────────────┴───────────────┘               │
│  The GPU is mostly COMPUTING. Memory transfers overlap with compute.    │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  DECODE (each 10 ms step):                                              │
│  ┌───┬──────────────────────────────────────────────────────────────┐   │
│  │Cal│              Waiting for memory                              │   │
│  │0.5│                    ~9.5 ms                                   │   │
│  └───┴──────────────────────────────────────────────────────────────┘   │
│  The GPU is mostly WAITING. Compute finishes almost instantly.          │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  FULL REQUEST TIMELINE:                                                 │
│                                                                         │
│  0 ms        40 ms                                              2040 ms │
│  │            │                                                      │  │
│  ▼            ▼                                                      ▼  │
│  ┌────────────┬──────────────────────────────────────────────────────┐  │
│  │▓▓PREFILL▓▓▓│░░░░░░░░░░░░░░░░░ DECODE ░░░░░░░░░░░░░░░░░░░░░░░░░░░░│  │
│  │ 500 tokens │ token token token token ... (200 tokens) ... token  │  │
│  │ parallel   │   1     2     3     4                         200   │  │
│  └────────────┴──────────────────────────────────────────────────────┘  │
│                                                                         │
│  ▓▓▓ = High GPU utilization (compute-bound)                            │
│  ░░░ = Low GPU utilization (memory-bound)                              │
│                                                                         │
│  98% of your inference time is spent with the GPU mostly idle.          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Cost Analysis: What Are You Paying For?

Let's translate utilization into dollars:

┌─────────────────────────────────────────────────────────────────────────┐
│                        COST EFFICIENCY ANALYSIS                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  A100 GPU cost: ~$2/hour (cloud) or ~$15,000 (purchase)                │
│  A100 peak capability: 312 TFLOPS                                       │
│                                                                         │
│  For our 2.04 second request:                                           │
│                                                                         │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Phase    │ Time    │ FLOPs    │ Utilization │ "Wasted" capacity │    │
│  ├──────────┼─────────┼──────────┼─────────────┼───────────────────┤    │
│  │ Prefill  │ 40 ms   │ 6.7 T    │ 54%         │ 46%               │    │
│  │ Decode   │ 2000 ms │ 2.64 T   │ 0.4%        │ 99.6%             │    │
│  └──────────┴─────────┴──────────┴─────────────┴───────────────────┘    │
│                                                                         │
│  During decode, for every 1 FLOP of useful work, the GPU COULD have    │
│  done 250 FLOPs. You're paying for 250× more compute than you use.     │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  Total compute available in 2.04 s: 312 T × 2.04 = 636 T FLOPs         │
│  Total compute used: 6.7 T + 2.64 T = 9.34 T FLOPs                     │
│  Overall utilization: 9.34 / 636 = 1.5%                                │
│                                                                         │
│  You're using 1.5% of what you're paying for!                          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

This is why inference optimization matters economically. If you can increase utilization from 1.5% to 15% (10×), you can serve 10× more requests with the same hardware, or reduce your GPU costs by 90%.

Scaling Analysis: What Happens With Different Inputs?

Let's see how time breakdown changes with different prompt/generation lengths:

┌─────────────────────────────────────────────────────────────────────────┐
│             TIME BREAKDOWN FOR DIFFERENT REQUEST PROFILES               │
├───────────────────┬──────────┬──────────┬──────────┬───────────────────┤
│ Scenario          │ Prefill  │ Decode   │ Total    │ Decode % of total │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Short prompt,     │          │          │          │                   │
│ short output      │          │          │          │                   │
│ (100 → 50)        │ 10 ms    │ 500 ms   │ 510 ms   │ 98.0%             │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Medium prompt,    │          │          │          │                   │
│ medium output     │          │          │          │                   │
│ (500 → 200)       │ 40 ms    │ 2000 ms  │ 2040 ms  │ 98.0%             │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Long prompt,      │          │          │          │                   │
│ short output      │          │          │          │                   │
│ (2000 → 50)       │ 180 ms   │ 500 ms   │ 680 ms   │ 73.5%             │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Short prompt,     │          │          │          │                   │
│ long output       │          │          │          │                   │
│ (100 → 500)       │ 10 ms    │ 5000 ms  │ 5010 ms  │ 99.8%             │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Long prompt,      │          │          │          │                   │
│ long output       │          │          │          │                   │
│ (2000 → 500)      │ 180 ms   │ 5000 ms  │ 5180 ms  │ 96.5%             │
├───────────────────┼──────────┼──────────┼──────────┼───────────────────┤
│ Very long prompt, │          │          │          │                   │
│ short output      │          │          │          │                   │
│ (8000 → 50)       │ 1200 ms  │ 550 ms   │ 1750 ms  │ 31.4%             │
└───────────────────┴──────────┴──────────┴──────────┴───────────────────┘

Key observations:
• For most typical use cases, decode dominates (>95% of time)
• Only with very long prompts AND short outputs does prefill become significant
• The longer the generated output, the more decode dominates
• This is why most optimization research focuses on decode

Tokens Per Second: The User-Facing Metric

Users experience inference speed as "tokens per second." Let's calculate this:

┌─────────────────────────────────────────────────────────────────────────┐
│                    TOKENS PER SECOND ANALYSIS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  For our example (500 prompt → 200 generated):                          │
│                                                                         │
│  Time to first token (TTFT):                                            │
│    = Prefill time                                                       │
│    = 40 ms                                                              │
│    User waits 40 ms before seeing any output.                           │
│                                                                         │
│  Time between tokens (TBT) / Inter-token latency:                       │
│    = Decode step time                                                   │
│    = 10 ms per token                                                    │
│    = 100 tokens/second generation rate                                  │
│                                                                         │
│  Total generation time (for 200 tokens):                                │
│    = TTFT + (200 × TBT)                                                 │
│    = 40 ms + 2000 ms                                                    │
│    = 2040 ms                                                            │
│                                                                         │
│  Effective throughput:                                                  │
│    = 200 tokens / 2.04 seconds                                          │
│    = 98 tokens/second (output only)                                     │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  User experience:                                                       │
│  • Wait 40 ms (imperceptible)                                           │
│  • Text streams at ~100 tokens/sec (~400-500 characters/sec)            │
│  • Total wait for full response: ~2 seconds                             │
│                                                                         │
│  This feels reasonably fast for interactive use, but:                   │
│  • At scale (1000s of users), this GPU serves ~50 requests/sec max      │
│  • Each request monopolizes the GPU for ~2 seconds                      │
│  • Cost per request is significant at cloud GPU prices                  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Key Metrics Summary

┌─────────────────────────────────────────────────────────────────────────┐
│              KEY INFERENCE METRICS (500 → 200 tokens, LLaMA-7B)         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  LATENCY METRICS                                                        │
│  ────────────────                                                       │
│  Time to First Token (TTFT):     40 ms                                  │
│  Inter-Token Latency (ITL):      10 ms/token                            │
│  Total Generation Time:          2040 ms                                │
│                                                                         │
│  THROUGHPUT METRICS                                                     │
│  ─────────────────                                                      │
│  Generation rate:                100 tokens/sec                         │
│  Effective throughput:           98 tokens/sec                          │
│                                                                         │
│  EFFICIENCY METRICS                                                     │
│  ─────────────────                                                      │
│  Prefill GPU utilization:        ~55%                                   │
│  Decode GPU utilization:         ~1-2%                                  │
│  Overall GPU utilization:        ~1.5%                                  │
│  Decode share of total time:     98%                                    │
│                                                                         │
│  HARDWARE UTILIZATION                                                   │
│  ────────────────────                                                   │
│  Compute used:                   9.34 TFLOPS                            │
│  Compute available:              636 TFLOPS                             │
│  Utilization ratio:              1.5%                                   │
│                                                                         │
│  MEMORY BANDWIDTH                                                       │
│  ────────────────                                                       │
│  Prefill bandwidth utilization:  ~35% (14 GB / 40 ms / 2 TB/s)         │
│  Decode bandwidth utilization:   ~70% (14 GB / 10 ms / 2 TB/s)         │
│                                                                         │
│  Note: Decode achieves HIGHER bandwidth utilization because it's       │
│  memory-bound. Prefill is compute-bound so bandwidth isn't saturated.  │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Why This Matters: The Optimization Imperative

This quantitative analysis reveals why LLM inference is such an active area of research and engineering:

┌─────────────────────────────────────────────────────────────────────────┐
│                     THE OPTIMIZATION IMPERATIVE                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  PROBLEM: 98% of inference time is spent in a low-efficiency phase     │
│                                                                         │
│  IMPLICATION: Even small improvements to decode have large impact       │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  If you improve decode by 2× (5 ms/token instead of 10 ms):             │
│    Old total time: 40 + 2000 = 2040 ms                                  │
│    New total time: 40 + 1000 = 1040 ms                                  │
│    Speedup: 1.96× for end-to-end latency                               │
│                                                                         │
│  If you improve prefill by 2× (20 ms instead of 40 ms):                │
│    Old total time: 40 + 2000 = 2040 ms                                  │
│    New total time: 20 + 2000 = 2020 ms                                  │
│    Speedup: 1.01× (barely noticeable)                                   │
│                                                                         │
│  ───────────────────────────────────────────────────────────────────    │
│                                                                         │
│  This is why optimization research focuses heavily on decode:           │
│                                                                         │
│  • Batching: Process multiple requests together to increase             │
│    arithmetic intensity during decode                                   │
│                                                                         │
│  • Quantization: Reduce model size (14 GB → 3.5 GB) to reduce          │
│    memory bandwidth requirements                                        │
│                                                                         │
│  • Speculative decoding: Generate multiple tokens per decode step      │
│    to amortize the memory bandwidth cost                                │
│                                                                         │
│  • KV cache compression: Reduce cache size to lower memory reads       │
│                                                                         │
│  • Continuous batching: Maximize GPU utilization across requests       │
│                                                                         │
│  All of these target the decode bottleneck in different ways.          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Summary: The Numbers That Matter

For typical LLM inference (500 prompt tokens → 200 generated tokens on LLaMA-7B):

Prefill takes ~40 ms (2% of total time), achieves ~55% GPU utilization, is compute-bound.
Decode takes ~2000 ms (98% of total time), achieves ~2% GPU utilization, is memory-bandwidth-bound.
Each decode step reads ~14 GB of model weights to produce one token.
Overall GPU utilization is ~1.5% — you're paying for 66× more compute than you use.
The utilization paradox: Low utilization doesn't mean "easy work" — it means "waiting for memory." That's why decode is slow despite doing less computation than prefill.
Optimization leverage: 2× improvement in decode → ~2× end-to-end speedup. 2× improvement in prefill → ~1% end-to-end speedup. Focus on decode.

Check Your Understanding

Before moving on:

If you're serving LLaMA-7B on an A100 and your users complain about latency, which phase should you optimize first? Why?
A colleague says "Our GPU utilization is only 2%, so we should be able to handle 50× more traffic." What's wrong with this reasoning?
For a use case with 4000-token prompts and 50-token outputs, how would the prefill vs decode time breakdown differ from our 500→200 example?

2. Attention Optimizations: From Standard Attention to FlashAttention

February 9, 2026

2.2c: FlashAttention — IO Analysis and Evolution

February 9, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote