1.5: Decode — Computational Deep Dive

Published January 27, 2026

The Paradox We Need to Explain
What Happens in a Single Decode Step
Operation 1: Embedding Lookup
Operation 2: Q, K, V Projections (Per Layer)
Operation 3: KV Cache Read and Update
Operation 4: Attention Computation
Operation 5: Output Projection
Operation 6: FFN Layers
Operation 7: Repeat for All Layers, Then Output
The Arithmetic Intensity Disaster
Q Projection (Decode vs Prefill Comparison)
All Decode Operations Analysis
The Memory Bandwidth Bottleneck
Time Calculation
Where the Time Goes
GPU Utilization During Decode
Why You Must Read Weights Every Step
GPU Memory Hierarchy
The Decode Data Flow
Time Complexity of Decode
Time Per Step
Total Decode Time
Scaling Behavior
Prefill vs Decode: The Complete Picture
The Fundamental Inefficiency
Summary: Why Decode Is Slow
Check Your Understanding
The Paradox We Need to Explain

In Artifact 4, we saw that prefill processes 512 tokens in ~30-50ms. Simple math suggests decode should be faster—after all, we're only processing 1 token per step instead of 512.

But reality is the opposite: each decode step takes ~10ms. Processing 200 tokens takes ~2000ms. That's 40× slower than prefill despite processing 2.5× fewer tokens.

This artifact explains why. The answer lies in arithmetic intensity and the memory-bandwidth bound.

What Happens in a Single Decode Step

Let's trace through exactly what operations occur when we generate one token. Assume we're at position seq_len (the cache already contains seq_len tokens' worth of K and V).

Operation 1: Embedding Lookup

Input: 1 token ID
Output: [1, hidden_dim] tensor, e.g., [1, 4096]

This is a simple table lookup—negligible time.

Operation 2: Q, K, V Projections (Per Layer)

hidden_states: [1, 4096]
W_Q:           [4096, 4096]   ← 67 MB to read!
W_K:           [4096, 4096]   ← 67 MB to read!
W_V:           [4096, 4096]   ← 67 MB to read!

Q_new = hidden_states @ W_Q   →  [1, 4096]
K_new = hidden_states @ W_K   →  [1, 4096]
V_new = hidden_states @ W_V   →  [1, 4096]

Operation 3: KV Cache Read and Update

Read from cache:
  K_cached: [seq_len, num_heads, head_dim]  e.g., [1000, 32, 128] = 16 MB
  V_cached: [seq_len, num_heads, head_dim]  e.g., [1000, 32, 128] = 16 MB

Append new K, V:
  K_full = concat(K_cached, K_new)  →  [seq_len+1, 32, 128]
  V_full = concat(V_cached, V_new)  →  [seq_len+1, 32, 128]

Write back to cache: 32 MB

Operation 4: Attention Computation

Q_new: [1, 32, 128]           (32 heads, 128 dim each)
K_full: [seq_len+1, 32, 128]

For each head:
  scores = Q_new @ K_full.T   →  [1, seq_len+1]
  weights = softmax(scores)   →  [1, seq_len+1]
  output = weights @ V_full   →  [1, 128]

This is a [1, seq_len+1] attention—one query against many keys.

Operation 5: Output Projection

attention_output: [1, 4096]
W_O:              [4096, 4096]   ← 67 MB to read!

output = attention_output @ W_O  →  [1, 4096]

Operation 6: FFN Layers

x: [1, 4096]
W1: [4096, 16384]    ← 268 MB to read!
W2: [16384, 4096]    ← 268 MB to read!

ffn_out = GELU(x @ W1) @ W2  →  [1, 4096]

Operation 7: Repeat for All Layers, Then Output

After all 32 layers:
final_hidden: [1, 4096]
W_output: [4096, vocab_size]  e.g., [4096, 32000] = 524 MB to read!

logits = final_hidden @ W_output  →  [1, 32000]
next_token = sample(logits)

The Arithmetic Intensity Disaster

Now let's calculate the arithmetic intensity for these decode operations. Remember from Artifact 4: we need ~156 FLOPs/byte to keep an A100 fully utilized.

Q Projection (Decode vs Prefill Comparison)

Decode (1 token):

hidden_states: [1, 4096]
W_Q:           [4096, 4096]

FLOPs = 2 × 1 × 4096 × 4096 = 33.6 million
Bytes = 2 × (1×4096 + 4096×4096 + 1×4096) ≈ 67 MB (dominated by W_Q)

Arithmetic Intensity = 33.6M / 67M = 0.5 FLOPs/byte

This is 312× BELOW the threshold!

Prefill (512 tokens) — for comparison:

Arithmetic Intensity = 410 FLOPs/byte (from [Artifact 4](https://huggingface.co/blog/atharv6f/prefill-computational-deep-dive))

The same weight matrix, the same memory read, but 1/512th the computation. The GPU reads 67 MB of weights to perform a trivial amount of math.

All Decode Operations Analysis

┌─────────────────────────────────────────────────────────────────────────┐
│               DECODE OPERATIONS (1 token, seq_len=1000)                 │
├──────────────────────┬──────────────┬─────────────┬────────────────────┤
│ Operation            │ FLOPs        │ Bytes       │ Arithmetic         │
│                      │              │             │ Intensity          │
├──────────────────────┼──────────────┼─────────────┼────────────────────┤
│ Q projection         │ 33.6 M       │ 67 MB       │ 0.5 FLOPs/byte ✗   │
│ K projection         │ 33.6 M       │ 67 MB       │ 0.5 FLOPs/byte ✗   │
│ V projection         │ 33.6 M       │ 67 MB       │ 0.5 FLOPs/byte ✗   │
│ KV cache read        │ ~0           │ 32 MB       │ ~0 FLOPs/byte ✗    │
│ Q @ Kᵀ (attention)   │ 0.26 M       │ ~0 (cached) │ high (but tiny)    │
│ softmax @ V          │ 0.26 M       │ ~0 (cached) │ high (but tiny)    │
│ Output projection    │ 33.6 M       │ 67 MB       │ 0.5 FLOPs/byte ✗   │
│ FFN layer 1          │ 134 M        │ 268 MB      │ 0.5 FLOPs/byte ✗   │
│ FFN layer 2          │ 134 M        │ 268 MB      │ 0.5 FLOPs/byte ✗   │
├──────────────────────┼──────────────┼─────────────┼────────────────────┤
│ Total per layer      │ ~400 M       │ ~840 MB     │ 0.5 FLOPs/byte ✗   │
│ All 32 layers        │ ~13 B        │ ~27 GB      │ 0.5 FLOPs/byte ✗   │
│ + Output projection  │ +0.26 B      │ +0.5 GB     │                    │
├──────────────────────┼──────────────┼─────────────┼────────────────────┤
│ TOTAL per decode step│ ~13 B        │ ~28 GB      │ 0.5 FLOPs/byte ✗   │
└──────────────────────┴──────────────┴─────────────┴────────────────────┘

Required for compute-bound: 156 FLOPs/byte
Actual achieved:           0.5 FLOPs/byte

We are 312× below the threshold. The GPU is massively underutilized.

The Memory Bandwidth Bottleneck

Since decode is memory-bandwidth-bound, the time is determined by how fast we can read data, not how fast we can compute.

Time Calculation

Decode time ≈ Bytes to transfer / Memory bandwidth

For one decode step on LLaMA-7B on A100:

Bytes to transfer:
  • Model weights:     ~14 GB (read all weights once per layer × 32 layers)
  • KV cache read:     ~32 MB × 32 layers = 1 GB (for seq_len=1000)
  • KV cache write:    ~1 MB × 32 layers = 32 MB (append new K,V)
  • Activations:       negligible
  
  Total: ~15 GB per decode step

A100 memory bandwidth: 2 TB/s

Time = 15 GB / 2 TB/s = 7.5 ms

(Real-world: ~8-12 ms due to overhead, cache effects, etc.)

Notice what determines the time: reading the model weights. Every decode step must read essentially all model parameters from GPU memory. The actual computation (13 billion FLOPs) could be done in microseconds if the data were already available.

Where the Time Goes

┌─────────────────────────────────────────────────────────────────────────┐
│                    TIME BREAKDOWN: ONE DECODE STEP                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Total time: ~10 ms                                                     │
│                                                                         │
│  ┌────────────────────────────────────────────────────────────────────┐ │
│  │████████████████████████████████████████████████████████████████████│ │
│  │█████████████████ READING WEIGHTS ██████████████████████████████████│ │
│  │████████████████████████████████████████████████████████████████████│ │
│  └────────────────────────────────────────────────────────────────────┘ │
│   ▲                                                                   ▲ │
│   0 ms                          ~7 ms                              10 ms│
│                                                                         │
│  ┌──────────────────────┐                                               │
│  │░░ KV Cache Read ░░░░░│  ~0.5 ms (grows with sequence length)        │
│  └──────────────────────┘                                               │
│                                                                         │
│  ┌────┐                                                                 │
│  │Comp│  < 0.1 ms (actual arithmetic)                                   │
│  └────┘                                                                 │
│                                                                         │
│  The GPU spends ~95% of its time waiting for data from memory.          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

GPU Utilization During Decode

Because decode is memory-bound, GPU compute utilization is extremely low:

Decode GPU Utilization (typical):

┌────────────────────────────────────────────────────────────────┐
│████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│ 
│███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
│████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
│███░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
└────────────────────────────────────────────────────────────────┘
 ▲      ▲                                                       ▲
 0%   ~5-10%                                                  100%

Theoretical utilization = Actual FLOPs / Peak FLOPs
                       = 13B / (312 TFLOPS × 0.010s)
                       = 13B / 3.12T
                       = 0.4%

Even accounting for overhead, utilization is typically < 10%.
You're paying for a $10,000 GPU and using < 10% of its compute power.

This is the fundamental inefficiency of LLM inference that drives most optimization research.

Why You Must Read Weights Every Step

You might wonder: "Why read the weights every step? Can't we keep them in cache?"

The answer reveals a fundamental constraint of GPU architecture.

GPU Memory Hierarchy

┌─────────────────────────────────────────────────────────────────────────┐
│                        GPU MEMORY HIERARCHY                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────┐                                                    │
│  │   Registers     │  Fastest (~20 TB/s effective)                      │
│  │   ~256 KB/SM    │  But tiny capacity                                 │
│  └────────┬────────┘                                                    │
│           │                                                             │
│           ▼                                                             │
│  ┌─────────────────┐                                                    │
│  │   L1 Cache /    │  Very fast (~10 TB/s)                              │
│  │ Shared Memory   │  ~128-256 KB per SM                                │
│  │                 │  Total: ~10-20 MB across GPU                       │
│  └────────┬────────┘                                                    │
│           │                                                             │
│           ▼                                                             │
│  ┌─────────────────┐                                                    │
│  │    L2 Cache     │  Fast (~5 TB/s)                                    │
│  │    ~40-50 MB    │  Shared across all SMs                             │
│  └────────┬────────┘                                                    │
│           │                                                             │
│           ▼                                                             │
│  ┌─────────────────┐                                                    │
│  │   HBM (Main     │  "Slow" (2 TB/s)                                   │
│  │   GPU Memory)   │  But large: 40-80 GB                               │
│  │                 │  This is where model weights live                  │
│  └─────────────────┘                                                    │
│                                                                         │
│  LLaMA-7B weights: ~14 GB                                               │
│  L2 cache:         ~50 MB                                               │
│                                                                         │
│  The weights are 280× larger than the cache!                            │
│  They MUST be read from HBM for each operation.                         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The model weights simply don't fit in cache. Every decode step must stream them from HBM (main GPU memory), which is the bandwidth bottleneck.

The Decode Data Flow

For each decode step:

    HBM (GPU Memory)                         Compute Units
    ┌──────────────┐                         ┌──────────┐
    │              │    Stream weights       │          │
    │   Weights    │ ───────────────────────►│  Matrix  │
    │   (14 GB)    │     @ 2 TB/s            │  Multiply│
    │              │                         │          │
    ├──────────────┤    Stream KV cache      │          │
    │   KV Cache   │ ───────────────────────►│   (for   │
    │   (1+ GB)    │     @ 2 TB/s            │ 1 token) │
    │              │                         │          │
    └──────────────┘                         └──────────┘
           │                                      │
           │         ~7-10 ms                     │ ~0.04 ms
           │      (memory transfer)              │ (compute)
           │                                      │
           └──────────────────────────────────────┘
           
    The compute units finish in microseconds, then wait
    for the next batch of data to arrive from memory.

Time Complexity of Decode

Time Per Step

Each decode step time is dominated by memory bandwidth:

Time per decode step ≈ (Model size + KV cache size) / Memory bandwidth

For LLaMA-7B at sequence position s:
  Model: 14 GB (constant)
  KV cache: ~0.5 MB × s (grows linearly with sequence length)

At s=500:   Time ≈ (14 GB + 0.25 GB) / 2 TB/s ≈ 7.1 ms
At s=1000:  Time ≈ (14 GB + 0.5 GB) / 2 TB/s ≈ 7.3 ms
At s=4000:  Time ≈ (14 GB + 2 GB) / 2 TB/s ≈ 8.0 ms
At s=16000: Time ≈ (14 GB + 8 GB) / 2 TB/s ≈ 11.0 ms

For moderate sequences, time per step is roughly constant (dominated by model weights). For very long sequences, the KV cache read starts to matter.

Total Decode Time

To generate G tokens:

Total decode time ≈ G × (time per step)
                  ≈ G × (Model size / Bandwidth)
                  ≈ O(G) — linear in generated tokens

For 200 tokens on LLaMA-7B:
  Time ≈ 200 × 7.5 ms = 1500 ms = 1.5 seconds

Plus KV cache overhead as sequence grows (minor for moderate lengths).

Scaling Behavior

┌─────────────────────────────────────────────────────────────────────────┐
│                    DECODE TIME VS TOKENS GENERATED                      │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Time (ms)                                                              │
│      ▲                                                                  │
│      │                                                           ×      │
│ 4000 ┤                                                     ×            │
│      │                                               ×                  │
│ 3000 ┤                                         ×                        │
│      │                                   ×                              │
│ 2000 ┤                             ×                                    │
│      │                       ×                                          │
│ 1000 ┤                 ×                                                │
│      │           ×                                                      │
│  500 ┤     ×                                                            │
│      │ ×                                                                │
│      └──────────────────────────────────────────────────────────────────►
│        50   100  150  200  250  300  350  400  450  500                 │
│                     Tokens generated                                    │
│                                                                         │
│  Nearly linear: each token costs ~7-10 ms regardless of position        │
│  (slight increase for very long sequences due to KV cache growth)       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Prefill vs Decode: The Complete Picture

Let's put everything together with a direct comparison:

┌─────────────────────────────────────────────────────────────────────────┐
│                    PREFILL VS DECODE: COMPLETE COMPARISON               │
├────────────────────────┬────────────────────────┬───────────────────────┤
│        Metric          │       PREFILL          │        DECODE         │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Tokens per forward pass│ N (all prompt tokens)  │ 1                     │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Arithmetic intensity   │ ~400-500 FLOPs/byte    │ ~0.5 FLOPs/byte       │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Bound by               │ Compute (FLOPS)        │ Memory bandwidth      │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ GPU utilization        │ 70-85%                 │ < 10%                 │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Time determined by     │ FLOPs / Compute power  │ Bytes / Bandwidth     │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Bottleneck             │ How fast GPU computes  │ How fast GPU reads    │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Weight matrices        │ Read once, used N times│ Read once, used once  │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Scales with            │ O(N) to O(N²)          │ O(G) linear           │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Example time           │ 512 tokens: ~30-50 ms  │ 200 tokens: ~1500 ms  │
├────────────────────────┼────────────────────────┼───────────────────────┤
│ Parallelizable         │ Yes (within pass)      │ No (sequential)       │
└────────────────────────┴────────────────────────┴───────────────────────┘

The Fundamental Inefficiency

This is the core problem of LLM inference:

┌─────────────────────────────────────────────────────────────────────────┐
│                     THE DECODE INEFFICIENCY PROBLEM                     │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   For each decode step:                                                 │
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │  READ: ~14 GB of model weights from GPU memory                  │   │
│   │  COMPUTE: ~13 billion FLOPs (trivial for GPU)                   │   │
│   │  OUTPUT: 1 token                                                │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│   You read the ENTIRE MODEL to produce ONE TOKEN.                       │
│                                                                         │
│   The GPU is a supercomputer that could do 312 trillion ops/second,     │
│   but it's stuck waiting for data 95% of the time.                      │
│                                                                         │
│   It's like having a Formula 1 car stuck in traffic.                    │
│                                                                         │
│   ─────────────────────────────────────────────────────────────────     │
│                                                                         │
│   This is why inference optimization matters.                           │
│   This is why batching helps (more on this in Artifact 7).             │
│   This is why model compression, quantization, and speculative          │
│   decoding exist—they all attack this fundamental bottleneck.           │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Summary: Why Decode Is Slow

Batch size of 1: Each step processes exactly one token, so we can't amortize memory reads across multiple tokens.
Low arithmetic intensity (~0.5 FLOPs/byte): Far below the ~156 FLOPs/byte threshold needed for compute-bound operation.
Memory-bandwidth-bound: Time is determined by how fast we can read model weights from GPU memory, not by computation speed.
Must read all weights every step: The model weights (~14 GB for LLaMA-7B) don't fit in cache and must be streamed from HBM for every single token.
Sequential dependency: Can't parallelize across tokens because token N+1 depends on token N.
GPU underutilization: Despite having enormous compute capability, the GPU sits idle ~90-95% of the time waiting for data.

The result: Generating 200 tokens takes ~1.5-2 seconds even on a $10,000+ GPU, with most of that time spent waiting for memory transfers.

Check Your Understanding

Before moving on:

If you double the memory bandwidth of a GPU (keeping compute constant), roughly how much faster would decode get? What about prefill?
Why doesn't processing 1 token take 1/512th the time of processing 512 tokens, even though it's 1/512th the computation?
A user complains: "I have an A100 but decode is slow. Should I upgrade to a GPU with more FLOPS?" What would you tell them?

2. Attention Optimizations: From Standard Attention to FlashAttention

February 9, 2026

2.2c: FlashAttention — IO Analysis and Evolution

February 9, 2026

Community

Julianx2005

Jun 5

Absolute Masterpiece! Thank you so much

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote