Torch LayerNorm Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.30s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Dec 19 19:40:36 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   26C    P8             24W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

LayerNorm Benchmark (PyTorch)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 32.13s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)


run_benchmark(
    kernel_type=KernelTypeEnum.LAYER_NORM,
    impl_name="torch_layer_norm",
    impl_tags={"family": "torch", "op": "layer_norm"},
    impl_func=torch_layer_norm,
)
Running layer_norm benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         2.46%     151.464us        66.01%       4.061ms       4.061ms       0.000us         0.00%       3.020ms       3.020ms             1  
                                       aten::layer_norm         0.24%      14.681us        63.55%       3.910ms       1.303ms       0.000us         0.00%       3.020ms       1.007ms             3  
                                aten::native_layer_norm        20.97%       1.290ms        63.31%       3.895ms       1.298ms       2.310ms       100.00%       3.020ms       1.007ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.311ms       100.06%       2.311ms       2.311ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.310ms       100.00%       2.310ms     770.057us             3  
                                Activity Buffer Request        40.34%       2.482ms        40.34%       2.482ms       2.482ms     709.854us        30.73%     709.854us     709.854us             1  
                                            aten::empty         1.09%      66.873us         1.09%      66.873us       7.430us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.79%      48.731us         0.79%      48.731us      16.244us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.12%       7.460us         0.12%       7.460us       1.243us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        33.99%       2.091ms        33.99%       2.091ms       2.091ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.152ms
Self CUDA time total: 2.310ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.07%      70.812us        28.19%       1.857ms       1.857ms       0.000us         0.00%       6.442ms       6.442ms             1  
                                       aten::layer_norm         0.14%       9.000us        27.11%       1.786ms     595.403us       0.000us         0.00%       6.442ms       2.147ms             3  
                                aten::native_layer_norm         0.75%      49.502us        26.98%       1.777ms     592.403us       4.862ms       100.00%       6.442ms       2.147ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.864ms       100.03%       4.864ms       4.864ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.862ms       100.00%       4.862ms       1.621ms             3  
                                Activity Buffer Request        25.31%       1.667ms        25.31%       1.667ms       1.667ms       1.580ms        32.49%       1.580ms       1.580ms             1  
                                            aten::empty         0.43%      28.150us         0.43%      28.150us       3.128us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.44%      28.800us         0.44%      28.800us       9.600us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       3.751us         0.06%       3.751us       0.625us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        71.81%       4.731ms        71.81%       4.731ms       4.731ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.588ms
Self CUDA time total: 4.862ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.08%      70.451us        29.89%       1.957ms       1.957ms       0.000us         0.00%       6.239ms       6.239ms             1  
                                       aten::layer_norm         0.13%       8.611us        28.81%       1.886ms     628.738us       0.000us         0.00%       6.239ms       2.080ms             3  
                                aten::native_layer_norm         0.76%      49.870us        28.68%       1.878ms     625.867us       4.724ms       100.00%       6.239ms       2.080ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.726ms       100.03%       4.726ms       4.726ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.724ms       100.00%       4.724ms       1.575ms             3  
                                Activity Buffer Request        26.98%       1.766ms        26.98%       1.766ms       1.766ms       1.515ms        32.08%       1.515ms       1.515ms             1  
                                            aten::empty         0.45%      29.490us         0.45%      29.490us       3.277us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.43%      27.941us         0.43%      27.941us       9.314us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       4.101us         0.06%       4.101us       0.684us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        70.11%       4.590ms        70.11%       4.590ms       4.590ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.547ms
Self CUDA time total: 4.724ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.65%      74.391us        15.11%       1.731ms       1.731ms       0.000us         0.00%      13.123ms      13.123ms             1  
                                       aten::layer_norm         0.08%       9.310us        14.46%       1.656ms     552.093us       0.000us         0.00%      13.123ms       4.374ms             3  
                                aten::native_layer_norm         0.45%      52.052us        14.38%       1.647ms     548.989us       9.864ms       100.00%      13.123ms       4.374ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       9.866ms       100.01%       9.866ms       9.866ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.864ms       100.00%       9.864ms       3.288ms             3  
                                Activity Buffer Request        11.61%       1.330ms        11.61%       1.330ms       1.330ms       3.258ms        33.03%       3.258ms       3.258ms             1  
                                            aten::empty         0.27%      31.120us         0.27%      31.120us       3.458us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         2.01%     229.635us         2.01%     229.635us      76.545us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.04%       4.651us         0.04%       4.651us       0.775us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        84.89%       9.721ms        84.89%       9.721ms       9.721ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 11.451ms
Self CUDA time total: 9.864ms


impl                     wl                  p50(ms)  ok
torch_layer_norm         LN_B16_S2048_D4096     0.81  True
torch_layer_norm         LN_B16_S2048_D8192     1.68  True
torch_layer_norm         LN_B16_S4096_D4096     1.61  True
torch_layer_norm         LN_B16_S4096_D8192     3.32  True
▶ UV Install Logs

Artifacts:

layer_norm.jsonl