+

Flash Attention Benchmarks - Aggregated Results

+

This document combines benchmark results from multiple attention implementations +using cross-file dependencies.

+

Combined Summary and Visualization

+
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: combine | 36.17s + | + +Raw +
+ +
+
LOADING BENCHMARK DATA +Flash (PyTorch SDPA) : /repo/flash_attn/impls/.uvnote/cache/327a3408e7cdfeef6984786686ce13137074d9f083e6e434c29f02589d28a0f8 +MemEff (PyTorch SDPA) : /repo/flash_attn/impls/.uvnote/cache/25ca9e52daa50b9289780b3e1302f2949db718140ef9eedd44a8a554afaff9ee +Flash Attn 2 : None +xFormers : /repo/flash_attn/impls/.uvnote/cache/6802a31176fbf22c1f5dd5442cf5ae77d8e3527d679642244908984c16933902 +SageAttention : None +Compiled (default) : /repo/flash_attn/impls/.uvnote/cache/bd779935ea10d468a5a99c29b029da0e0ef4dc2a7b82bc8595d04b2f142a3a44 +Compiled (max-autotune) : /repo/flash_attn/impls/.uvnote/cache/f4bc4785407df53e53f91c190279cdf3dbe3cf7028e2e352d1cc90b92bfcf86e +HF Kernels Flash Attn : /repo/flash_attn/impls/.uvnote/cache/58c243a8f4effc711ed67ad97c7fbf2124304388a17b4b8e4e43e20e6019e9c9 +HF Kernels Flash Attn3 : /repo/flash_attn/impls/.uvnote/cache/65da999faf55d11c76155fa1d198e77708e1fe8247e3d0b5fd7093a206551ce5 + +✓ Found Flash (PyTorch SDPA): /repo/flash_attn/impls/.uvnote/cache/327a3408e7cdfeef6984786686ce13137074d9f083e6e434c29f02589d28a0f8/attn.jsonl +✓ Found MemEff (PyTorch SDPA): /repo/flash_attn/impls/.uvnote/cache/25ca9e52daa50b9289780b3e1302f2949db718140ef9eedd44a8a554afaff9ee/attn.jsonl +✗ No cache dir for Flash Attn 2 +✓ Found xFormers: /repo/flash_attn/impls/.uvnote/cache/6802a31176fbf22c1f5dd5442cf5ae77d8e3527d679642244908984c16933902/attn.jsonl +✗ No cache dir for SageAttention +✓ Found Compiled (default): /repo/flash_attn/impls/.uvnote/cache/bd779935ea10d468a5a99c29b029da0e0ef4dc2a7b82bc8595d04b2f142a3a44/attn_default.jsonl +✓ Found Compiled (max-autotune): /repo/flash_attn/impls/.uvnote/cache/f4bc4785407df53e53f91c190279cdf3dbe3cf7028e2e352d1cc90b92bfcf86e/attn_max_autotune.jsonl +✓ Found HF Kernels Flash Attn: /repo/flash_attn/impls/.uvnote/cache/58c243a8f4effc711ed67ad97c7fbf2124304388a17b4b8e4e43e20e6019e9c9/attn.jsonl +✓ Found HF Kernels Flash Attn3: /repo/flash_attn/impls/.uvnote/cache/65da999faf55d11c76155fa1d198e77708e1fe8247e3d0b5fd7093a206551ce5/attn.jsonl + +COMBINED BENCHMARK SUMMARY + +impl wl p50(ms) ok +hf_kernels_flash_attn flux_L128 0.34 True +hf_kernels_flash_attn flux_L256 0.37 True +hf_kernels_flash_attn flux_L320 0.49 True +hf_kernels_flash_attn flux_L384 0.51 True +hf_kernels_flash_attn flux_L448 0.53 True +hf_kernels_flash_attn flux_L512 0.56 True +hf_kernels_flash_attn3 flux_L128 0.36 True +hf_kernels_flash_attn3 flux_L256 0.39 True +hf_kernels_flash_attn3 flux_L320 0.52 True +hf_kernels_flash_attn3 flux_L384 0.53 True +hf_kernels_flash_attn3 flux_L448 0.57 True +hf_kernels_flash_attn3 flux_L512 0.57 True +torch_flash_compiled_default flux_L128 0.52 True +torch_flash_compiled_default flux_L256 0.56 True +torch_flash_compiled_default flux_L320 0.69 True +torch_flash_compiled_default flux_L384 0.72 True +torch_flash_compiled_default flux_L448 0.74 True +torch_flash_compiled_default flux_L512 0.77 True +torch_flash_compiled_max_autotune flux_L128 0.65 True +torch_flash_compiled_max_autotune flux_L256 0.68 True +torch_flash_compiled_max_autotune flux_L320 0.82 True +torch_flash_compiled_max_autotune flux_L384 0.85 True +torch_flash_compiled_max_autotune flux_L448 0.88 True +torch_flash_compiled_max_autotune flux_L512 0.92 True +torch_flash_ma flux_L128 0.48 True +torch_flash_ma flux_L256 0.53 True +torch_flash_ma flux_L320 0.65 True +torch_flash_ma flux_L384 0.68 True +torch_flash_ma flux_L448 0.71 True +torch_flash_ma flux_L512 0.74 True +torch_mem_eff flux_L128 0.59 True +torch_mem_eff flux_L256 0.65 True +torch_mem_eff flux_L320 0.78 True +torch_mem_eff flux_L384 0.79 True +torch_mem_eff flux_L448 0.85 True +torch_mem_eff flux_L512 0.95 True +xformers_meff flux_L128 0.45 True +xformers_meff flux_L256 0.47 True +xformers_meff flux_L320 0.60 True +xformers_meff flux_L384 0.60 True +xformers_meff flux_L448 0.64 True +xformers_meff flux_L512 0.64 True + +GENERATING COMBINED VISUALIZATION + +Loaded 42 records +Saved latency.png +✓ Combined visualization saved as latency.png + +ANALYSIS COMPLETE +Total implementations analyzed: 7 + +Implementations included: + ✓ Flash (PyTorch SDPA) + ✓ MemEff (PyTorch SDPA) + ✓ xFormers + ✓ Compiled (default) + ✓ Compiled (max-autotune) + ✓ HF Kernels Flash Attn + ✓ HF Kernels Flash Attn3 +
+
+
▶ UV Install Logs
+ +
+
+

Artifacts:

+latency.png +
+latency.png +
+
+
+
+