2. Attention Optimizations: From Standard Attention to FlashAttention
Standard attention spends most of its time moving data, not computing. The N×N intermediate matrices account for 97% of memory traffic, and this scales quadratically with sequence length. This series covers why that happens and how FlashAttention fixes it.
Interactive Tool
Explore the concepts hands-on with the FlashAttention Explorer:
Try the FlashAttention Explorer
Benchmark attention backends on real models, compare GQA vs MQA memory usage, analyze prefill vs decode phases, and calculate memory budgets with configurable precision.
The Series
The Problem
2.1: Standard Attention — The IO Problem
How the naive softmax(QK^T/√d)V implementation creates 33× more memory traffic than necessary, and why softmax's global dependency forces the N×N matrix to be written to HBM.
The Solution
2.2a: FlashAttention — The Tiling Strategy
Dividing Q, K, V into blocks that fit in SRAM so the N×N matrix never needs to exist in HBM. The loop structure, why reloading K/V blocks is cheap, and where tiling helps most.
2.2b: FlashAttention — Online Softmax
Computing exact softmax from partial scores. Tracking a running max and sum, rescaling when the max changes, and why this produces the same result as standard softmax.
2.2c: FlashAttention — IO Analysis and Evolution
Quantifying the IO improvement, why more FLOPs can mean less time, and what FlashAttention-2 and FlashAttention-3 changed.
Key Takeaways
- Standard attention wastes 97% of memory traffic on intermediate N×N matrices
- FlashAttention reduces HBM traffic by ~33× for typical configs (N=4096, d=128)
- Tiling + online softmax = exact attention without materializing O(N²)
- The benefit is largest during prefill; single-token decode sees little difference
- GQA/MQA reduce KV cache size separately — orthogonal to FlashAttention
This is Part 2 of a series on LLM systems. Part 1 covers inference foundations — the autoregressive loop, KV cache, prefill vs decode, and the utilization paradox.