HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.25s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Dec 19 23:01:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   39C    P0             82W /  350W |       0MiB /  46068MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 8.49s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      76.129us      1844.21%      76.129us      76.129us             1  
                                      hf_kernels_swiglu         8.60%     174.603us        99.27%       2.015ms       2.015ms       0.000us         0.00%       5.568us       5.568us             1  
                      _activation_23bf3fb::silu_and_mul         0.97%      19.670us        88.54%       1.797ms     599.020us       4.128us       100.00%       5.568us       1.856us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.128us       100.00%       4.128us       1.376us             3  
                                Activity Buffer Request        85.37%       1.733ms        85.37%       1.733ms       1.733ms       1.440us        34.88%       1.440us       1.440us             1  
                                            aten::empty         2.13%      43.191us         2.13%      43.191us      14.397us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.20%      44.752us         2.20%      44.752us      14.917us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.73%      14.741us         0.73%      14.741us      14.741us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.030ms
Self CUDA time total: 4.128us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      62.783us      1582.23%      62.783us      62.783us             1  
                                      hf_kernels_swiglu         4.95%      92.601us        99.70%       1.863ms       1.863ms       0.000us         0.00%       5.312us       5.312us             1  
                      _activation_23bf3fb::silu_and_mul         1.25%      23.392us        93.77%       1.753ms     584.220us       3.968us       100.00%       5.312us       1.771us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us       100.00%       3.968us       1.323us             3  
                                Activity Buffer Request        91.17%       1.704ms        91.17%       1.704ms       1.704ms       1.344us        33.87%       1.344us       1.344us             1  
                                            aten::empty         0.97%      18.160us         0.97%      18.160us       6.053us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.35%      25.221us         1.35%      25.221us       8.407us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.620us         0.30%       5.620us       5.620us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.869ms
Self CUDA time total: 3.968us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      61.887us      1264.03%      61.887us      61.887us             1  
                                      hf_kernels_swiglu         4.90%      91.392us        99.70%       1.861ms       1.861ms       0.000us         0.00%       6.528us       6.528us             1  
                      _activation_23bf3fb::silu_and_mul         1.06%      19.772us        93.81%       1.751ms     583.690us       4.896us       100.00%       6.528us       2.176us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us       100.00%       4.896us       1.632us             3  
                                Activity Buffer Request        91.42%       1.706ms        91.42%       1.706ms       1.706ms       1.632us        33.33%       1.632us       1.632us             1  
                                            aten::empty         1.00%      18.580us         1.00%      18.580us       6.193us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.33%      24.870us         1.33%      24.870us       8.290us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.30%       5.640us         0.30%       5.640us       5.640us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.867ms
Self CUDA time total: 4.896us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      66.431us      1560.88%      66.431us      66.431us             1  
                                      hf_kernels_swiglu         4.62%      96.552us        99.72%       2.084ms       2.084ms       0.000us         0.00%       5.696us       5.696us             1  
                      _activation_23bf3fb::silu_and_mul         0.92%      19.230us        94.20%       1.969ms     656.267us       4.256us       100.00%       5.696us       1.899us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.256us       100.00%       4.256us       1.419us             3  
                                Activity Buffer Request        82.63%       1.727ms        82.63%       1.727ms       1.727ms       1.440us        33.83%       1.440us       1.440us             1  
                                            aten::empty         0.91%      18.961us         0.91%      18.961us       6.320us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        10.64%     222.454us        10.64%     222.454us      74.151us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.800us         0.28%       5.800us       5.800us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.090ms
Self CUDA time total: 4.256us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      62.753us      1065.60%      62.753us      62.753us             1  
                                      hf_kernels_swiglu         4.32%      90.233us        99.73%       2.084ms       2.084ms       0.000us         0.00%       7.842us       7.842us             1  
                      _activation_23bf3fb::silu_and_mul         0.98%      20.530us        94.51%       1.975ms     658.421us       5.889us       100.00%       7.842us       2.614us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.889us       100.00%       5.889us       1.963us             3  
                                Activity Buffer Request        83.43%       1.744ms        83.43%       1.744ms       1.744ms       1.953us        33.16%       1.953us       1.953us             1  
                                            aten::empty         0.90%      18.820us         0.90%      18.820us       6.273us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        10.09%     210.974us        10.09%     210.974us      70.325us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       5.680us         0.27%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.090ms
Self CUDA time total: 5.889us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      58.974us       761.74%      58.974us      58.974us             1  
                                      hf_kernels_swiglu        14.39%      83.563us        99.11%     575.543us     575.543us       0.000us         0.00%      10.333us      10.333us             1  
                      _activation_23bf3fb::silu_and_mul         3.37%      19.590us        81.67%     474.270us     158.090us       7.742us       100.00%      10.333us       3.444us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.742us       100.00%       7.742us       2.581us             3  
                                Activity Buffer Request        43.30%     251.476us        43.30%     251.476us     251.476us       2.591us        33.47%       2.591us       2.591us             1  
                                            aten::empty         3.05%      17.710us         3.05%      17.710us       5.903us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.99%     203.204us        34.99%     203.204us      67.735us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.89%       5.190us         0.89%       5.190us       5.190us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 580.733us
Self CUDA time total: 7.742us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      60.191us       908.68%      60.191us      60.191us             1  
                                      hf_kernels_swiglu        14.49%      83.902us        99.19%     574.293us     574.293us       0.000us         0.00%       8.832us       8.832us             1  
                      _activation_23bf3fb::silu_and_mul         3.38%      19.561us        81.54%     472.101us     157.367us       6.624us       100.00%       8.832us       2.944us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.624us       100.00%       6.624us       2.208us             3  
                                Activity Buffer Request        43.39%     251.205us        43.39%     251.205us     251.205us       2.208us        33.33%       2.208us       2.208us             1  
                                            aten::empty         3.16%      18.290us         3.16%      18.290us       6.097us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.77%     201.335us        34.77%     201.335us      67.112us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.81%       4.680us         0.81%       4.680us       4.680us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 578.973us
Self CUDA time total: 6.624us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      64.480us       685.45%      64.480us      64.480us             1  
                                      hf_kernels_swiglu         4.47%      90.662us        99.76%       2.023ms       2.023ms       0.000us         0.00%      12.543us      12.543us             1  
                      _activation_23bf3fb::silu_and_mul         0.98%      19.960us        94.38%       1.913ms     637.817us       9.407us       100.00%      12.543us       4.181us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.407us       100.00%       9.407us       3.136us             3  
                                Activity Buffer Request        83.63%       1.695ms        83.63%       1.695ms       1.695ms       3.136us        33.34%       3.136us       3.136us             1  
                                            aten::empty         0.91%      18.421us         0.91%      18.421us       6.140us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.77%     198.004us         9.77%     198.004us      66.001us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.24%       4.950us         0.24%       4.950us       4.950us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.027ms
Self CUDA time total: 9.407us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      60.576us       465.11%      60.576us      60.576us             1  
                                      hf_kernels_swiglu        15.18%      83.082us        99.12%     542.352us     542.352us       0.000us         0.00%      17.408us      17.408us             1  
                      _activation_23bf3fb::silu_and_mul         3.66%      20.041us        80.66%     441.340us     147.113us      13.024us       100.00%      17.408us       5.803us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.024us       100.00%      13.024us       4.341us             3  
                                Activity Buffer Request        41.24%     225.625us        41.24%     225.625us     225.625us       4.384us        33.66%       4.384us       4.384us             1  
                                            aten::empty         3.28%      17.930us         3.28%      17.930us       5.977us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.76%     195.674us        35.76%     195.674us      65.225us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.88%       4.811us         0.88%       4.811us       4.811us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 547.163us
Self CUDA time total: 13.024us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.02  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s]Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Fetching 7 files: 29%|██▊ | 2/7 [00:00<00:00, 17.51it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 14.39it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 20.57it/s]

Artifacts:

activation.jsonl