Torch LayerNorm Implementation

GPU Info

▼ code ▼ output ▶ uv-logs | Cell: nv | 0.23s | Raw GitHub
import subprocess
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
Fri Oct 31 20:00:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
| N/A   32C    P0             85W /  350W |       0MiB /  46068MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

LayerNorm Benchmark (PyTorch)

▼ code ▼ output ▶ uv-logs | Cell: benchmark | 3.89s | Raw GitHub
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "numpy",
#     "torch==2.8.0",
#     "kernels-benchmark-tools",
# ]
#
# [tool.uv.sources]
# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
# ///
import torch
import sys
from kernels_benchmark_tools import KernelTypeEnum, run_benchmark


def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)


run_benchmark(
    kernel_type=KernelTypeEnum.LAYER_NORM,
    impl_name="torch_layer_norm",
    impl_tags={"family": "torch", "op": "layer_norm"},
    impl_func=torch_layer_norm,
)
Running layer_norm benchmark on cuda with 4 workloads.

======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         3.88%     150.743us        46.08%       1.790ms       1.790ms       0.000us         0.00%       3.031ms       3.031ms             1  
                                       aten::layer_norm         0.46%      17.882us        42.20%       1.639ms     546.344us       0.000us         0.00%       3.031ms       1.010ms             3  
                                aten::native_layer_norm         2.05%      79.451us        41.74%       1.621ms     540.384us       2.322ms       100.00%       3.031ms       1.010ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.323ms       100.06%       2.323ms       2.323ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.322ms       100.00%       2.322ms     773.873us             3  
                                Activity Buffer Request        37.13%       1.442ms        37.13%       1.442ms       1.442ms     709.660us        30.57%     709.660us     709.660us             1  
                                            aten::empty         1.23%      47.623us         1.23%      47.623us       5.291us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         1.17%      45.281us         1.17%      45.281us      15.094us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.17%       6.710us         0.17%       6.710us       1.118us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        53.92%       2.094ms        53.92%       2.094ms       2.094ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.884ms
Self CUDA time total: 2.322ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.99%     129.362us        27.22%       1.769ms       1.769ms       0.000us         0.00%       6.490ms       6.490ms             1  
                                       aten::layer_norm         0.17%      10.831us        25.23%       1.640ms     546.698us       0.000us         0.00%       6.490ms       2.163ms             3  
                                aten::native_layer_norm         0.91%      59.414us        25.06%       1.629ms     543.087us       4.900ms       100.00%       6.490ms       2.163ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.901ms       100.03%       4.901ms       4.901ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.900ms       100.00%       4.900ms       1.633ms             3  
                                Activity Buffer Request        23.14%       1.504ms        23.14%       1.504ms       1.504ms       1.590ms        32.46%       1.590ms       1.590ms             1  
                                            aten::empty         0.46%      29.779us         0.46%      29.779us       3.309us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.49%      31.860us         0.49%      31.860us      10.620us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       3.750us         0.06%       3.750us       0.625us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        72.78%       4.732ms        72.78%       4.732ms       4.732ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.501ms
Self CUDA time total: 4.900ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         1.73%     108.072us        26.73%       1.674ms       1.674ms       0.000us         0.00%       6.258ms       6.258ms             1  
                                       aten::layer_norm         0.14%       8.910us        25.01%       1.566ms     522.010us       0.000us         0.00%       6.258ms       2.086ms             3  
                                aten::native_layer_norm         0.87%      54.314us        24.86%       1.557ms     519.040us       4.736ms       100.00%       6.258ms       2.086ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.737ms       100.03%       4.737ms       4.737ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.736ms       100.00%       4.736ms       1.579ms             3  
                                Activity Buffer Request        23.05%       1.444ms        23.05%       1.444ms       1.444ms       1.522ms        32.13%       1.522ms       1.522ms             1  
                                            aten::empty         0.46%      28.531us         0.46%      28.531us       3.170us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         0.43%      26.620us         0.43%      26.620us       8.873us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.06%       4.039us         0.06%       4.039us       0.673us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        73.27%       4.589ms        73.27%       4.589ms       4.589ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 6.263ms
Self CUDA time total: 4.736ms



======================================================================
PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192
======================================================================
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                       torch_layer_norm         0.85%     101.562us        19.08%       2.285ms       2.285ms       0.000us         0.00%      13.093ms      13.093ms             1  
                                       aten::layer_norm         0.08%       9.511us        18.23%       2.184ms     727.942us       0.000us         0.00%      13.093ms       4.364ms             3  
                                aten::native_layer_norm         0.48%      57.051us        18.15%       2.174ms     724.772us       9.846ms       100.00%      13.093ms       4.364ms             3  
                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       9.848ms       100.01%       9.848ms       9.848ms             1  
void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.846ms       100.00%       9.846ms       3.282ms             3  
                                Activity Buffer Request        11.95%       1.431ms        11.95%       1.431ms       1.431ms       3.247ms        32.97%       3.247ms       3.247ms             1  
                                            aten::empty         0.24%      29.142us         0.24%      29.142us       3.238us       0.000us         0.00%       0.000us       0.000us             9  
                                       cudaLaunchKernel         5.45%     653.217us         5.45%     653.217us     217.739us       0.000us         0.00%       0.000us       0.000us             3  
                                             aten::view         0.03%       3.890us         0.03%       3.890us       0.648us       0.000us         0.00%       0.000us       0.000us             6  
                                  cudaDeviceSynchronize        80.92%       9.693ms        80.92%       9.693ms       9.693ms       0.000us         0.00%       0.000us       0.000us             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 11.979ms
Self CUDA time total: 9.846ms


impl                     wl                  p50(ms)  ok
torch_layer_norm         LN_B16_S2048_D4096     0.82  True
torch_layer_norm         LN_B16_S2048_D8192     1.68  True
torch_layer_norm         LN_B16_S4096_D4096     1.61  True
torch_layer_norm         LN_B16_S4096_D8192     3.33  True

Artifacts:

layer_norm.jsonl