HF Kernels - Flash Attention

HuggingFace Kernels Flash Attention Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 6.08s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the flash attention kernel
hf_kernels_flash_attn = get_kernel("kernels-community/flash-attn")


def hf_flash_attention(query, key, value):
    """HuggingFace Kernels Flash Attention"""
    return hf_kernels_flash_attn.fwd(query, key, value, is_causal=False)[0]


run_benchmark(
    kernel_type=KernelTypeEnum.ATTENTION,
    impl_name="hf_kernels_flash_attn",
    impl_tags={"family": "hf-kernels", "backend": "flash-attn", "compile": "none"},
    impl_func=hf_flash_attention,
)
Running attention benchmark on cuda with 6 workloads.

======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L128_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         3.64%     160.058us        41.50%       1.823ms       1.823ms       0.000us         0.00%       3.744ms       3.744ms             1  
                               _flash_attn_9e27194::fwd         1.78%      78.347us        37.86%       1.663ms     554.208us       2.792ms       100.00%       3.744ms       1.248ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       2.794ms       100.05%       2.794ms       2.794ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       2.792ms       100.00%       2.792ms     930.800us             3  
                                Activity Buffer Request        33.00%       1.449ms        33.00%       1.449ms       1.449ms     951.685us        34.08%     951.685us     951.685us             1  
                                 cudaDeviceGetAttribute         0.13%       5.638us         0.13%       5.638us       0.376us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.40%      17.551us         1.19%      52.122us      17.374us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.79%      34.571us         0.79%      34.571us      11.524us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.57%      24.890us         0.57%      24.890us       2.766us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.28%      12.210us         0.28%      12.210us       4.070us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.92%      40.292us         0.92%      40.292us      13.431us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        58.50%       2.569ms        58.50%       2.569ms       2.569ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.392ms
Self CUDA time total: 2.792ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L256_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.22%      99.144us        37.48%       1.673ms       1.673ms       0.000us         0.00%       3.949ms       3.949ms             1  
                               _flash_attn_9e27194::fwd         1.20%      53.462us        35.26%       1.574ms     524.654us       2.953ms       100.00%       3.949ms       1.316ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       2.955ms       100.05%       2.955ms       2.955ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       2.953ms       100.00%       2.953ms     984.436us             3  
                                Activity Buffer Request        32.23%       1.439ms        32.23%       1.439ms       1.439ms     995.807us        33.72%     995.807us     995.807us             1  
                                 cudaDeviceGetAttribute         0.10%       4.621us         0.10%       4.621us       0.308us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.17%       7.710us         0.56%      24.861us       8.287us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.38%      17.151us         0.38%      17.151us       5.717us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.47%      21.122us         0.47%      21.122us       2.347us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.08%       3.791us         0.08%       3.791us       1.264us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.61%      27.380us         0.61%      27.380us       9.127us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        62.52%       2.791ms        62.52%       2.791ms       2.791ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.464ms
Self CUDA time total: 2.953ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L320_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.58%     116.955us        37.54%       1.702ms       1.702ms       0.000us         0.00%       4.041ms       4.041ms             1  
                               _flash_attn_9e27194::fwd         1.53%      69.255us        34.96%       1.585ms     528.314us       3.010ms       100.00%       4.041ms       1.347ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.012ms       100.05%       3.012ms       3.012ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.010ms       100.00%       3.010ms       1.003ms             3  
                                Activity Buffer Request        31.53%       1.430ms        31.53%       1.430ms       1.430ms       1.031ms        34.26%       1.031ms       1.031ms             1  
                                 cudaDeviceGetAttribute         0.10%       4.450us         0.10%       4.450us       0.297us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.18%       8.151us         0.57%      25.801us       8.600us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.39%      17.650us         0.39%      17.650us       5.883us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.48%      21.771us         0.48%      21.771us       2.419us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.10%       4.360us         0.10%       4.360us       1.453us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         0.66%      29.790us         0.66%      29.790us       9.930us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        62.46%       2.832ms        62.46%       2.832ms       2.832ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.534ms
Self CUDA time total: 3.010ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L384_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.39%     114.805us        40.03%       1.925ms       1.925ms       0.000us         0.00%       4.094ms       4.094ms             1  
                               _flash_attn_9e27194::fwd         1.09%      52.653us        37.65%       1.810ms     603.407us       3.063ms       100.00%       4.094ms       1.365ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.065ms       100.05%       3.065ms       3.065ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.063ms       100.00%       3.063ms       1.021ms             3  
                                Activity Buffer Request        29.78%       1.432ms        29.78%       1.432ms       1.432ms       1.031ms        33.65%       1.031ms       1.031ms             1  
                                 cudaDeviceGetAttribute         0.10%       4.861us         0.10%       4.861us       0.324us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.16%       7.720us         0.55%      26.331us       8.777us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.39%      18.611us         0.39%      18.611us       6.204us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.45%      21.731us         0.45%      21.731us       2.415us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.08%       3.728us         0.08%       3.728us       1.243us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         5.59%     268.862us         5.59%     268.862us      89.621us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        59.97%       2.884ms        59.97%       2.884ms       2.884ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.809ms
Self CUDA time total: 3.063ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L448_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.13%     113.755us        35.84%       1.918ms       1.918ms       0.000us         0.00%       4.786ms       4.786ms             1  
                               _flash_attn_9e27194::fwd         1.02%      54.483us        33.71%       1.804ms     601.364us       3.588ms       100.00%       4.786ms       1.595ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.590ms       100.04%       3.590ms       3.590ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.588ms       100.00%       3.588ms       1.196ms             3  
                                Activity Buffer Request        26.99%       1.445ms        26.99%       1.445ms       1.445ms       1.198ms        33.38%       1.198ms       1.198ms             1  
                                 cudaDeviceGetAttribute         0.08%       4.270us         0.08%       4.270us       0.285us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.15%       8.039us         0.48%      25.640us       8.547us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.33%      17.601us         0.33%      17.601us       5.867us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.40%      21.582us         0.40%      21.582us       2.398us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.07%       3.700us         0.07%       3.700us       1.233us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.67%     249.891us         4.67%     249.891us      83.297us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        64.16%       3.434ms        64.16%       3.434ms       3.434ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.351ms
Self CUDA time total: 3.588ms



======================================================================
PROFILE TRACE: hf_kernels_flash_attn | cuda_attn_L512_bfloat16
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  hf_kernels_flash_attn         2.08%     111.044us        35.25%       1.879ms       1.879ms       0.000us         0.00%       4.816ms       4.816ms             1  
                               _flash_attn_9e27194::fwd         0.99%      52.834us        33.17%       1.768ms     589.427us       3.606ms       100.00%       4.816ms       1.605ms             3  
                                  hf_kernels_flash_attn         0.00%       0.000us         0.00%       0.000us       0.000us       3.607ms       100.05%       3.607ms       3.607ms             1  
void flash::flash_fwd_kernel<Flash_fwd_kernel_traits...         0.00%       0.000us         0.00%       0.000us       0.000us       3.606ms       100.00%       3.606ms       1.202ms             3  
                                Activity Buffer Request        26.56%       1.416ms        26.56%       1.416ms       1.416ms       1.210ms        33.55%       1.210ms       1.210ms             1  
                                 cudaDeviceGetAttribute         0.08%       4.460us         0.08%       4.460us       0.297us       0.000us         0.00%       0.000us       0.000us            15  
                                       aten::empty_like         0.14%       7.500us         0.49%      26.051us       8.684us       0.000us         0.00%       0.000us       0.000us             3  
                                    aten::empty_strided         0.35%      18.551us         0.35%      18.551us       6.184us       0.000us         0.00%       0.000us       0.000us             3  
                                            aten::empty         0.41%      21.960us         0.41%      21.960us       2.440us       0.000us         0.00%       0.000us       0.000us             9  
                                   cudaFuncSetAttribute         0.08%       4.009us         0.08%       4.009us       1.336us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         4.55%     242.792us         4.55%     242.792us      80.931us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize        64.75%       3.452ms        64.75%       3.452ms       3.452ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 5.332ms
Self CUDA time total: 3.606ms


impl                     wl                  p50(ms)  ok
hf_kernels_flash_attn    cuda_attn_L128_bfloat16     0.96  True
hf_kernels_flash_attn    cuda_attn_L256_bfloat16     1.01  True
hf_kernels_flash_attn    cuda_attn_L320_bfloat16     1.06  True
hf_kernels_flash_attn    cuda_attn_L384_bfloat16     1.05  True
hf_kernels_flash_attn    cuda_attn_L448_bfloat16     1.22  True
hf_kernels_flash_attn    cuda_attn_L512_bfloat16     1.21  True
▶ UV Install Logs
Fetching 20 files: 0%| | 0/20 [00:00<?, ?it/s] Fetching 20 files: 5%|▌ | 1/20 [00:00<00:04, 4.26it/s] Fetching 20 files: 10%|█ | 2/20 [00:01<00:17, 1.03it/s] Fetching 20 files: 100%|██████████| 20/20 [00:01<00:00, 11.64it/s]

Artifacts:

attention.jsonl