2. Attention Optimizations: From Standard Attention to FlashAttention

Community Article Published February 9, 2026

Standard attention spends most of its time moving data, not computing. The N×N intermediate matrices account for 97% of memory traffic, and this scales quadratically with sequence length. This series covers why that happens and how FlashAttention fixes it.

Interactive Tool

Explore the concepts hands-on with the FlashAttention Explorer:

Try the FlashAttention Explorer

Benchmark attention backends on real models, compare GQA vs MQA memory usage, analyze prefill vs decode phases, and calculate memory budgets with configurable precision.

The Series

The Problem

2.1: Standard Attention — The IO Problem

How the naive softmax(QK^T/√d)V implementation creates 33× more memory traffic than necessary, and why softmax's global dependency forces the N×N matrix to be written to HBM.

The Solution

2.2a: FlashAttention — The Tiling Strategy

Dividing Q, K, V into blocks that fit in SRAM so the N×N matrix never needs to exist in HBM. The loop structure, why reloading K/V blocks is cheap, and where tiling helps most.

2.2b: FlashAttention — Online Softmax

Computing exact softmax from partial scores. Tracking a running max and sum, rescaling when the max changes, and why this produces the same result as standard softmax.

2.2c: FlashAttention — IO Analysis and Evolution

Quantifying the IO improvement, why more FLOPs can mean less time, and what FlashAttention-2 and FlashAttention-3 changed.

Key Takeaways

Standard attention wastes 97% of memory traffic on intermediate N×N matrices
FlashAttention reduces HBM traffic by ~33× for typical configs (N=4096, d=128)
Tiling + online softmax = exact attention without materializing O(N²)
The benefit is largest during prefill; single-token decode sees little difference
GQA/MQA reduce KV cache size separately — orthogonal to FlashAttention

This is Part 2 of a series on LLM systems. Part 1 covers inference foundations — the autoregressive loop, KV cache, prefill vs decode, and the utilization paradox.

2.2c: FlashAttention — IO Analysis and Evolution

February 9, 2026

2.2b: FlashAttention — Online Softmax

February 3, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote