+

Torch LayerNorm Implementation

+

GPU Info

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: nv | 0.23s + | + +Raw +GitHub +
+
+
+
import subprocess
+print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
+
+ +
+
+
+
+
Wed Oct 29 00:36:39 2025       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
+| N/A   33C    P0            128W /  350W |       0MiB /  46068MiB |    100%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+
+
+
+ +

LayerNorm Benchmark (PyTorch)

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: benchmark | 7.38s + | + +Raw +GitHub +
+
+
+
# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "numpy",
+#     "torch==2.8.0",
+#     "kernels-benchmark-tools",
+# ]
+#
+# [tool.uv.sources]
+# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
+# ///
+import torch
+import sys
+from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
+
+
+def torch_layer_norm(x, weight, bias, eps: float = 1e-5):
+    return torch.nn.functional.layer_norm(x, (x.shape[-1],), weight, bias, eps)
+
+
+run_benchmark(
+    kernel_type=KernelTypeEnum.LAYER_NORM,
+    impl_name="torch_layer_norm",
+    impl_tags={"family": "torch", "op": "layer_norm"},
+    impl_func=torch_layer_norm,
+)
+
+ +
+
+
+
+
Running layer_norm benchmark on cuda with 4 workloads.
+
+======================================================================
+PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                       torch_layer_norm         3.94%     153.226us        45.99%       1.787ms       1.787ms       0.000us         0.00%       3.036ms       3.036ms             1  
+                                       aten::layer_norm         0.41%      15.819us        42.05%       1.634ms     544.665us       0.000us         0.00%       3.036ms       1.012ms             3  
+                                aten::native_layer_norm         2.10%      81.554us        41.64%       1.618ms     539.392us       2.323ms       100.00%       3.036ms       1.012ms             3  
+                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       2.325ms       100.06%       2.325ms       2.325ms             1  
+void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       2.323ms       100.00%       2.323ms     774.498us             3  
+                                Activity Buffer Request        36.88%       1.433ms        36.88%       1.433ms       1.433ms     712.322us        30.66%     712.322us     712.322us             1  
+                                            aten::empty         1.28%      49.611us         1.28%      49.611us       5.512us       0.000us         0.00%       0.000us       0.000us             9  
+                                       cudaLaunchKernel         1.19%      46.322us         1.19%      46.322us      15.441us       0.000us         0.00%       0.000us       0.000us             3  
+                                             aten::view         0.19%       7.380us         0.19%       7.380us       1.230us       0.000us         0.00%       0.000us       0.000us             6  
+                                  cudaDeviceSynchronize        54.01%       2.099ms        54.01%       2.099ms       2.099ms       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 3.886ms
+Self CUDA time total: 2.323ms
+
+
+
+======================================================================
+PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                       torch_layer_norm         1.13%      72.543us        25.40%       1.627ms       1.627ms       0.000us         0.00%       6.533ms       6.533ms             1  
+                                       aten::layer_norm         0.14%       8.900us        24.27%       1.554ms     518.074us       0.000us         0.00%       6.533ms       2.178ms             3  
+                                aten::native_layer_norm         0.84%      53.651us        24.13%       1.545ms     515.108us       4.915ms       100.00%       6.533ms       2.178ms             3  
+                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.917ms       100.03%       4.917ms       4.917ms             1  
+void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.915ms       100.00%       4.915ms       1.638ms             3  
+                                Activity Buffer Request        22.32%       1.430ms        22.32%       1.430ms       1.430ms       1.618ms        32.92%       1.618ms       1.618ms             1  
+                                            aten::empty         0.44%      28.460us         0.44%      28.460us       3.162us       0.000us         0.00%       0.000us       0.000us             9  
+                                       cudaLaunchKernel         0.46%      29.343us         0.46%      29.343us       9.781us       0.000us         0.00%       0.000us       0.000us             3  
+                                             aten::view         0.07%       4.330us         0.07%       4.330us       0.722us       0.000us         0.00%       0.000us       0.000us             6  
+                                  cudaDeviceSynchronize        74.60%       4.777ms        74.60%       4.777ms       4.777ms       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 6.403ms
+Self CUDA time total: 4.915ms
+
+
+
+======================================================================
+PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                       torch_layer_norm         1.16%      72.353us        26.06%       1.624ms       1.624ms       0.000us         0.00%       6.259ms       6.259ms             1  
+                                       aten::layer_norm         0.14%       8.650us        24.90%       1.551ms     517.051us       0.000us         0.00%       6.259ms       2.086ms             3  
+                                aten::native_layer_norm         0.85%      52.692us        24.76%       1.543ms     514.168us       4.742ms       100.00%       6.259ms       2.086ms             3  
+                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       4.744ms       100.03%       4.744ms       4.744ms             1  
+void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       4.742ms       100.00%       4.742ms       1.581ms             3  
+                                Activity Buffer Request        22.91%       1.427ms        22.91%       1.427ms       1.427ms       1.517ms        31.99%       1.517ms       1.517ms             1  
+                                            aten::empty         0.47%      29.452us         0.47%      29.452us       3.272us       0.000us         0.00%       0.000us       0.000us             9  
+                                       cudaLaunchKernel         0.47%      29.331us         0.47%      29.331us       9.777us       0.000us         0.00%       0.000us       0.000us             3  
+                                             aten::view         0.06%       4.009us         0.06%       4.009us       0.668us       0.000us         0.00%       0.000us       0.000us             6  
+                                  cudaDeviceSynchronize        73.94%       4.606ms        73.94%       4.606ms       4.606ms       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 6.229ms
+Self CUDA time total: 4.742ms
+
+
+
+======================================================================
+PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                       torch_layer_norm         0.67%      74.863us        13.13%       1.463ms       1.463ms       0.000us         0.00%      13.036ms      13.036ms             1  
+                                       aten::layer_norm         0.09%       9.640us        12.46%       1.388ms     462.622us       0.000us         0.00%      13.036ms       4.345ms             3  
+                                aten::native_layer_norm         0.46%      51.640us        12.37%       1.378ms     459.409us       9.812ms       100.00%      13.036ms       4.345ms             3  
+                                       torch_layer_norm         0.00%       0.000us         0.00%       0.000us       0.000us       9.814ms       100.01%       9.814ms       9.814ms             1  
+void at::native::(anonymous namespace)::vectorized_l...         0.00%       0.000us         0.00%       0.000us       0.000us       9.812ms       100.00%       9.812ms       3.271ms             3  
+                                Activity Buffer Request         9.60%       1.069ms         9.60%       1.069ms       1.069ms       3.224ms        32.85%       3.224ms       3.224ms             1  
+                                            aten::empty         0.26%      29.363us         0.26%      29.363us       3.263us       0.000us         0.00%       0.000us       0.000us             9  
+                                       cudaLaunchKernel         2.01%     223.547us         2.01%     223.547us      74.516us       0.000us         0.00%       0.000us       0.000us             3  
+                                             aten::view         0.04%       4.180us         0.04%       4.180us       0.697us       0.000us         0.00%       0.000us       0.000us             6  
+                                  cudaDeviceSynchronize        86.87%       9.675ms        86.87%       9.675ms       9.675ms       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 11.138ms
+Self CUDA time total: 9.812ms
+
+
+impl                     wl                  p50(ms)  ok
+torch_layer_norm         LN_B16_S2048_D4096     0.82  True
+torch_layer_norm         LN_B16_S2048_D8192     1.68  True
+torch_layer_norm         LN_B16_S4096_D4096     1.61  True
+torch_layer_norm         LN_B16_S4096_D8192     3.32  True
+
+
+
▶ UV Install Logs
+ +
+
+

Artifacts:

+layer_norm.jsonl +
+
+
+