HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.28s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Wed Oct 29 04:12:56 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   27C    P8             22W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 32.53s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      84.479us      2079.23%      84.479us      84.479us             1  
                                      hf_kernels_swiglu        10.30%     179.633us        99.61%       1.737ms       1.737ms       0.000us         0.00%       5.471us       5.471us             1  
                      _activation_beeaae6::silu_and_mul         1.22%      21.351us        86.54%       1.509ms     502.938us       4.063us       100.00%       5.471us       1.824us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
                                Activity Buffer Request        82.53%       1.439ms        82.53%       1.439ms       1.439ms       1.408us        34.65%       1.408us       1.408us             1  
                                            aten::empty         2.76%      48.131us         2.76%      48.131us      16.044us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.78%      48.541us         2.78%      48.541us      16.180us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.39%       6.861us         0.39%       6.861us       6.861us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.743ms
Self CUDA time total: 4.063us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      64.383us      1622.96%      64.383us      64.383us             1  
                                      hf_kernels_swiglu         5.77%      91.273us        99.69%       1.576ms       1.576ms       0.000us         0.00%       5.311us       5.311us             1  
                      _activation_beeaae6::silu_and_mul         1.42%      22.508us        92.74%       1.466ms     488.714us       3.967us       100.00%       5.311us       1.770us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.967us       100.00%       3.967us       1.322us             3  
                                Activity Buffer Request        89.71%       1.418ms        89.71%       1.418ms       1.418ms       1.344us        33.88%       1.344us       1.344us             1  
                                            aten::empty         1.18%      18.580us         1.18%      18.580us       6.193us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.61%      25.442us         1.61%      25.442us       8.481us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       4.900us         0.31%       4.900us       4.900us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.581ms
Self CUDA time total: 3.967us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.375us      1326.60%      65.375us      65.375us             1  
                                      hf_kernels_swiglu         5.63%      88.392us        99.68%       1.565ms       1.565ms       0.000us         0.00%       6.592us       6.592us             1  
                      _activation_beeaae6::silu_and_mul         1.42%      22.341us        92.82%       1.457ms     485.598us       4.928us       100.00%       6.592us       2.197us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.928us       100.00%       4.928us       1.643us             3  
                                Activity Buffer Request        89.75%       1.409ms        89.75%       1.409ms       1.409ms       1.664us        33.77%       1.664us       1.664us             1  
                                            aten::empty         1.23%      19.370us         1.23%      19.370us       6.457us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.64%      25.701us         1.64%      25.701us       8.567us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.32%       5.010us         0.32%       5.010us       5.010us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.570ms
Self CUDA time total: 4.928us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.864us      1618.05%      68.864us      68.864us             1  
                                      hf_kernels_swiglu         5.06%      90.612us        99.72%       1.787ms       1.787ms       0.000us         0.00%       5.696us       5.696us             1  
                      _activation_beeaae6::silu_and_mul         1.27%      22.842us        93.53%       1.676ms     558.683us       4.256us       100.00%       5.696us       1.899us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.256us       100.00%       4.256us       1.419us             3  
                                Activity Buffer Request        78.82%       1.412ms        78.82%       1.412ms       1.412ms       1.440us        33.83%       1.440us       1.440us             1  
                                            aten::empty         1.13%      20.320us         1.13%      20.320us       6.773us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        13.43%     240.735us        13.43%     240.735us      80.245us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.28%       5.081us         0.28%       5.081us       5.081us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.792ms
Self CUDA time total: 4.256us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      70.014us      1176.71%      70.014us      70.014us             1  
                                      hf_kernels_swiglu         5.43%      92.861us        99.73%       1.704ms       1.704ms       0.000us         0.00%       7.933us       7.933us             1  
                      _activation_beeaae6::silu_and_mul         1.32%      22.490us        93.06%       1.590ms     530.025us       5.950us       100.00%       7.933us       2.644us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.950us       100.00%       5.950us       1.983us             3  
                                Activity Buffer Request        82.71%       1.413ms        82.71%       1.413ms       1.413ms       1.983us        33.33%       1.983us       1.983us             1  
                                            aten::empty         1.24%      21.111us         1.24%      21.111us       7.037us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.03%     154.323us         9.03%     154.323us      51.441us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.27%       4.600us         0.27%       4.600us       4.600us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.709ms
Self CUDA time total: 5.950us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      71.711us       918.31%      71.711us      71.711us             1  
                                      hf_kernels_swiglu        20.20%      91.983us        98.97%     450.570us     450.570us       0.000us         0.00%      10.402us      10.402us             1  
                      _activation_beeaae6::silu_and_mul         4.90%      22.310us        74.58%     339.547us     113.182us       7.809us       100.00%      10.402us       3.467us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.809us       100.00%       7.809us       2.603us             3  
                                Activity Buffer Request        36.02%     164.004us        36.02%     164.004us     164.004us       2.593us        33.21%       2.593us       2.593us             1  
                                            aten::empty         4.18%      19.040us         4.18%      19.040us       6.347us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.66%     153.233us        33.66%     153.233us      51.078us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.03%       4.690us         1.03%       4.690us       4.690us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 455.260us
Self CUDA time total: 7.809us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      64.446us       968.24%      64.446us      64.446us             1  
                                      hf_kernels_swiglu        19.89%      86.491us        98.92%     430.210us     430.210us       0.000us         0.00%       8.897us       8.897us             1  
                      _activation_beeaae6::silu_and_mul         5.08%      22.091us        74.70%     324.868us     108.289us       6.656us       100.00%       8.897us       2.966us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.656us       100.00%       6.656us       2.219us             3  
                                Activity Buffer Request        34.88%     151.694us        34.88%     151.694us     151.694us       2.241us        33.67%       2.241us       2.241us             1  
                                            aten::empty         4.33%      18.851us         4.33%      18.851us       6.284us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.74%     151.083us        34.74%     151.083us      50.361us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.08%       4.700us         1.08%       4.700us       4.700us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 434.910us
Self CUDA time total: 6.656us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      69.471us       735.92%      69.471us      69.471us             1  
                                      hf_kernels_swiglu         5.54%      94.743us        99.69%       1.705ms       1.705ms       0.000us         0.00%      12.608us      12.608us             1  
                      _activation_beeaae6::silu_and_mul         1.25%      21.451us        93.03%       1.592ms     530.512us       9.440us       100.00%      12.608us       4.203us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.440us       100.00%       9.440us       3.147us             3  
                                Activity Buffer Request        82.96%       1.419ms        82.96%       1.419ms       1.419ms       3.168us        33.56%       3.168us       3.168us             1  
                                            aten::empty         1.12%      19.220us         1.12%      19.220us       6.407us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         8.81%     150.793us         8.81%     150.793us      50.264us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.230us         0.31%       5.230us       5.230us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.711ms
Self CUDA time total: 9.440us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.606us       520.41%      68.606us      68.606us             1  
                                      hf_kernels_swiglu        20.98%      86.561us        98.91%     408.129us     408.129us       0.000us         0.00%      17.599us      17.599us             1  
                      _activation_beeaae6::silu_and_mul         5.52%      22.769us        73.39%     302.816us     100.939us      13.183us       100.00%      17.599us       5.866us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.183us       100.00%      13.183us       4.394us             3  
                                Activity Buffer Request        29.84%     123.113us        29.84%     123.113us     123.113us       4.416us        33.50%       4.416us       4.416us             1  
                                            aten::empty         4.54%      18.752us         4.54%      18.752us       6.251us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        38.03%     156.934us        38.03%     156.934us      52.311us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.09%       4.500us         1.09%       4.500us       4.500us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 412.629us
Self CUDA time total: 13.183us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.03  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 12.63it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 17.67it/s]

Artifacts:

activation.jsonl