Torch LayerNorm Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.25s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Dec 19 22:48:33 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   30C    P0            107W /  350W |       0MiB /  46068MiB |     68%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

LayerNorm Benchmark (PyTorch)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 7.61s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)


run_benchmark(
    kernel_type=KernelTypeEnum.LAYER_NORM,
    impl_name="torch_layer_norm",
    impl_tags={"family": "torch", "op": "layer_norm"},
    impl_func=torch_layer_norm,
)
Running layer_norm benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.61%     151.022us        49.72%       2.081ms       2.081ms       0.000us         0.00%       3.037ms       3.037ms             1  
                                       aten::layer_norm         0.35%      14.701us        46.11%       1.930ms     643.468us       0.000us         0.00%       3.037ms       1.012ms             3  
                                aten::native_layer_norm         1.79%      75.131us        45.76%       1.916ms     638.567us       2.326ms       100.00%       3.037ms       1.012ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.327ms       100.06%       2.327ms       2.327ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.326ms       100.00%       2.326ms     775.187us             3  
                                Activity Buffer Request        41.50%       1.738ms        41.50%       1.738ms       1.738ms     711.774us        30.61%     711.774us     711.774us             1  
                                            aten::empty         1.17%      48.860us         1.17%      48.860us       5.429us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.12%      46.753us         1.12%      46.753us      15.584us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.18%       7.441us         0.18%       7.441us       1.240us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        50.28%       2.105ms        50.28%       2.105ms       2.105ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 4.186ms
Self CUDA time total: 2.326ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.05%      69.561us        28.39%       1.886ms       1.886ms       0.000us         0.00%       6.477ms       6.477ms             1  
                                       aten::layer_norm         0.13%       8.670us        27.34%       1.816ms     605.463us       0.000us         0.00%       6.477ms       2.159ms             3  
                                aten::native_layer_norm         0.77%      50.957us        27.21%       1.808ms     602.573us       4.891ms       100.00%       6.477ms       2.159ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.893ms       100.03%       4.893ms       4.893ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.891ms       100.00%       4.891ms       1.630ms             3  
                                Activity Buffer Request        25.53%       1.696ms        25.53%       1.696ms       1.696ms       1.586ms        32.42%       1.586ms       1.586ms             1  
                                            aten::empty         0.45%      29.753us         0.45%      29.753us       3.306us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.41%      27.542us         0.41%      27.542us       9.181us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.05%       3.522us         0.05%       3.522us       0.587us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        71.61%       4.758ms        71.61%       4.758ms       4.758ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.643ms
Self CUDA time total: 4.891ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.06%      68.562us        29.18%       1.889ms       1.889ms       0.000us         0.00%       6.234ms       6.234ms             1  
                                       aten::layer_norm         0.14%       9.330us        28.12%       1.821ms     606.966us       0.000us         0.00%       6.234ms       2.078ms             3  
                                aten::native_layer_norm         0.78%      50.590us        27.97%       1.812ms     603.856us       4.719ms       100.00%       6.234ms       2.078ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.721ms       100.03%       4.721ms       4.721ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.719ms       100.00%       4.719ms       1.573ms             3  
                                Activity Buffer Request        26.26%       1.700ms        26.26%       1.700ms       1.700ms       1.515ms        32.11%       1.515ms       1.515ms             1  
                                            aten::empty         0.44%      28.660us         0.44%      28.660us       3.184us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.43%      28.042us         0.43%      28.042us       9.347us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       3.840us         0.06%       3.840us       0.640us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        70.82%       4.586ms        70.82%       4.586ms       4.586ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.476ms
Self CUDA time total: 4.719ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.64%      72.823us        14.96%       1.710ms       1.710ms       0.000us         0.00%      13.144ms      13.144ms             1  
                                       aten::layer_norm         0.08%       8.940us        14.32%       1.637ms     545.678us       0.000us         0.00%      13.144ms       4.381ms             3  
                                aten::native_layer_norm         0.49%      56.431us        14.24%       1.628ms     542.698us       9.871ms       100.00%      13.144ms       4.381ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       9.872ms       100.02%       9.872ms       9.872ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.871ms       100.00%       9.871ms       3.290ms             3  
                                Activity Buffer Request        11.76%       1.344ms        11.76%       1.344ms       1.344ms       3.273ms        33.16%       3.273ms       3.273ms             1  
                                            aten::empty         0.26%      29.920us         0.26%      29.920us       3.324us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.69%     193.294us         1.69%     193.294us      64.431us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.04%       4.390us         0.04%       4.390us       0.732us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        85.04%       9.722ms        85.04%       9.722ms       9.722ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 11.432ms
Self CUDA time total: 9.871ms


impl                     wl                  p50(ms)  ok
torch_layer_norm         LN_B16_S2048_D4096     0.82  True
torch_layer_norm         LN_B16_S2048_D8192     1.68  True
torch_layer_norm         LN_B16_S4096_D4096     1.61  True
torch_layer_norm         LN_B16_S4096_D8192     3.32  True
▶ UV Install Logs

Artifacts:

layer_norm.jsonl