HF Kernels - SwiGLU Activation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.29s | Raw GitHub 🤗 HF
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Dec 19 19:54:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   35C    P0            120W /  350W |       0MiB /  46068MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

SwiGLU Benchmark

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 8.35s | Raw GitHub 🤗 HF
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
#     "kernels",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
from kernels import get_kernel

# Load the activation kernel
activation = get_kernel("kernels-community/activation")


def hf_kernels_swiglu(input_tensor):
    hidden_dim = input_tensor.shape[-1] // 2
    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
    return activation.silu_and_mul(out, input_tensor)


run_benchmark(
    kernel_type=KernelTypeEnum.ACTIVATION,
    impl_name="hf_kernels_swiglu",
    impl_tags={"family": "hf-kernels", "backend": "cuda"},
    impl_func=hf_kernels_swiglu,
)
Running activation benchmark on cuda with 9 workloads.

======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      85.600us      2073.64%      85.600us      85.600us             1  
                                      hf_kernels_swiglu         8.76%     183.666us        99.29%       2.081ms       2.081ms       0.000us         0.00%       5.568us       5.568us             1  
                      _activation_23bf3fb::silu_and_mul         0.98%      20.570us        88.50%       1.855ms     618.341us       4.128us       100.00%       5.568us       1.856us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.128us       100.00%       4.128us       1.376us             3  
                                Activity Buffer Request        85.39%       1.790ms        85.39%       1.790ms       1.790ms       1.440us        34.88%       1.440us       1.440us             1  
                                            aten::empty         2.03%      42.471us         2.03%      42.471us      14.157us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         2.13%      44.611us         2.13%      44.611us      14.870us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.71%      14.820us         0.71%      14.820us      14.820us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.096ms
Self CUDA time total: 4.128us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      66.111us      1666.52%      66.111us      66.111us             1  
                                      hf_kernels_swiglu         4.94%      94.004us        99.69%       1.897ms       1.897ms       0.000us         0.00%       5.311us       5.311us             1  
                      _activation_23bf3fb::silu_and_mul         0.99%      18.841us        93.73%       1.783ms     594.417us       3.967us       100.00%       5.311us       1.770us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.967us       100.00%       3.967us       1.322us             3  
                                Activity Buffer Request        91.36%       1.738ms        91.36%       1.738ms       1.738ms       1.344us        33.88%       1.344us       1.344us             1  
                                            aten::empty         1.01%      19.260us         1.01%      19.260us       6.420us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.38%      26.230us         1.38%      26.230us       8.743us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.31%       5.950us         0.31%       5.950us       5.950us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.902ms
Self CUDA time total: 3.967us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      68.479us      1380.35%      68.479us      68.479us             1  
                                      hf_kernels_swiglu         4.69%      88.684us        99.71%       1.886ms       1.886ms       0.000us         0.00%       6.625us       6.625us             1  
                      _activation_23bf3fb::silu_and_mul         0.99%      18.661us        94.04%       1.778ms     592.827us       4.961us       100.00%       6.625us       2.208us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.961us       100.00%       4.961us       1.654us             3  
                                Activity Buffer Request        91.53%       1.731ms        91.53%       1.731ms       1.731ms       1.664us        33.54%       1.664us       1.664us             1  
                                            aten::empty         0.98%      18.610us         0.98%      18.610us       6.203us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         1.52%      28.800us         1.52%      28.800us       9.600us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.29%       5.500us         0.29%       5.500us       5.500us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.891ms
Self CUDA time total: 4.961us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      66.368us      1547.76%      66.368us      66.368us             1  
                                      hf_kernels_swiglu         4.25%      87.402us        99.76%       2.051ms       2.051ms       0.000us         0.00%       5.760us       5.760us             1  
                      _activation_23bf3fb::silu_and_mul         0.97%      19.981us        94.58%       1.945ms     648.228us       4.288us       100.00%       5.760us       1.920us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us       100.00%       4.288us       1.429us             3  
                                Activity Buffer Request        83.83%       1.724ms        83.83%       1.724ms       1.724ms       1.472us        34.33%       1.472us       1.472us             1  
                                            aten::empty         0.93%      19.111us         0.93%      19.111us       6.370us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         9.77%     200.885us         9.77%     200.885us      66.962us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.24%       5.020us         0.24%       5.020us       5.020us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.056ms
Self CUDA time total: 4.288us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.360us      1131.72%      67.360us      67.360us             1  
                                      hf_kernels_swiglu         4.31%      89.293us        99.77%       2.067ms       2.067ms       0.000us         0.00%       7.968us       7.968us             1  
                      _activation_23bf3fb::silu_and_mul         0.98%      20.220us        94.55%       1.959ms     652.859us       5.952us       100.00%       7.968us       2.656us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.952us       100.00%       5.952us       1.984us             3  
                                Activity Buffer Request        85.78%       1.777ms        85.78%       1.777ms       1.777ms       2.016us        33.87%       2.016us       2.016us             1  
                                            aten::empty         0.91%      18.861us         0.91%      18.861us       6.287us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.79%     161.464us         7.79%     161.464us      53.821us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.23%       4.820us         0.23%       4.820us       4.820us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.072ms
Self CUDA time total: 5.952us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      64.574us       830.43%      64.574us      64.574us             1  
                                      hf_kernels_swiglu        18.42%      86.111us        98.86%     462.073us     462.073us       0.000us         0.00%      10.367us      10.367us             1  
                      _activation_23bf3fb::silu_and_mul         4.27%      19.980us        76.48%     357.451us     119.150us       7.776us       100.00%      10.367us       3.456us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.776us       100.00%       7.776us       2.592us             3  
                                Activity Buffer Request        38.90%     181.805us        38.90%     181.805us     181.805us       2.591us        33.32%       2.591us       2.591us             1  
                                            aten::empty         3.96%      18.511us         3.96%      18.511us       6.170us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        33.30%     155.666us        33.30%     155.666us      51.889us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.14%       5.330us         1.14%       5.330us       5.330us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 467.403us
Self CUDA time total: 7.776us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      62.527us       943.95%      62.527us      62.527us             1  
                                      hf_kernels_swiglu        18.86%      83.092us        98.85%     435.523us     435.523us       0.000us         0.00%       8.832us       8.832us             1  
                      _activation_23bf3fb::silu_and_mul         4.63%      20.380us        75.83%     334.080us     111.360us       6.624us       100.00%       8.832us       2.944us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.624us       100.00%       6.624us       2.208us             3  
                                Activity Buffer Request        36.44%     160.555us        36.44%     160.555us     160.555us       2.208us        33.33%       2.208us       2.208us             1  
                                            aten::empty         4.17%      18.351us         4.17%      18.351us       6.117us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        34.76%     153.145us        34.76%     153.145us      51.048us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.15%       5.060us         1.15%       5.060us       5.060us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 440.583us
Self CUDA time total: 6.624us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      69.184us       732.88%      69.184us      69.184us             1  
                                      hf_kernels_swiglu         4.54%      90.562us        99.76%       1.988ms       1.988ms       0.000us         0.00%      12.608us      12.608us             1  
                      _activation_23bf3fb::silu_and_mul         1.02%      20.260us        94.19%       1.877ms     625.705us       9.440us       100.00%      12.608us       4.203us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.440us       100.00%       9.440us       3.147us             3  
                                Activity Buffer Request        85.41%       1.702ms        85.41%       1.702ms       1.702ms       3.168us        33.56%       3.168us       3.168us             1  
                                            aten::empty         1.03%      20.450us         1.03%      20.450us       6.817us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel         7.76%     154.666us         7.76%     154.666us      51.555us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         0.24%       4.870us         0.24%       4.870us       4.870us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.993ms
Self CUDA time total: 9.440us



======================================================================
PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      65.376us       499.51%      65.376us      65.376us             1  
                                      hf_kernels_swiglu        19.52%      83.334us        98.75%     421.512us     421.512us       0.000us         0.00%      17.472us      17.472us             1  
                      _activation_23bf3fb::silu_and_mul         4.53%      19.340us        74.78%     319.198us     106.399us      13.088us       100.00%      17.472us       5.824us             3  
void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.088us       100.00%      13.088us       4.363us             3  
                                Activity Buffer Request        34.31%     146.444us        34.31%     146.444us     146.444us       4.384us        33.50%       4.384us       4.384us             1  
                                            aten::empty         4.45%      18.980us         4.45%      18.980us       6.327us       0.000us         0.00%       0.000us       0.000us             3  
                                       cudaLaunchKernel        35.94%     153.414us        35.94%     153.414us      51.138us       0.000us         0.00%       0.000us       0.000us             3  
                                  cudaDeviceSynchronize         1.25%       5.350us         1.25%       5.350us       5.350us       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 426.862us
Self CUDA time total: 13.088us


impl                     wl                  p50(ms)  ok
hf_kernels_swiglu        cuda_T128_D1024        0.03  True
hf_kernels_swiglu        cuda_T128_D2048        0.03  True
hf_kernels_swiglu        cuda_T128_D768         0.02  True
hf_kernels_swiglu        cuda_T256_D1024        0.03  True
hf_kernels_swiglu        cuda_T256_D2048        0.03  True
hf_kernels_swiglu        cuda_T256_D768         0.03  True
hf_kernels_swiglu        cuda_T512_D1024        0.03  True
hf_kernels_swiglu        cuda_T512_D2048        0.03  True
hf_kernels_swiglu        cuda_T512_D768         0.03  True
▶ UV Install Logs
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] Fetching 7 files: 14%|█▍ | 1/7 [00:00<00:01, 5.80it/s] Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 13.68it/s] Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 17.69it/s]

Artifacts:

activation.jsonl