diff --git "a/layer_norm/impls/torch_layer_norm.html" "b/layer_norm/impls/torch_layer_norm.html" --- "a/layer_norm/impls/torch_layer_norm.html" +++ "b/layer_norm/impls/torch_layer_norm.html" @@ -3887,7 +3887,7 @@ Cell: nv | 0.22s
Mon Oct 27 14:46:07 2025 +Tue Oct 28 14:08:35 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ @@ -3896,7 +3896,7 @@ Cell: nv | 0.22s | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40S On | 00000000:4D:00.0 Off | 0 | -| N/A 31C P0 79W / 350W | 0MiB / 46068MiB | 11% Default | +| N/A 31C P0 141W / 350W | 0MiB / 46068MiB | 21% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ @@ -3920,7 +3920,7 @@ Cell: nv | 0.22s ▼ output ▶ uv-logs | -Cell: benchmark | 7.77s +Cell: benchmark | 7.39s | Raw @@ -3959,1117 +3959,105 @@ Cell: benchmark | 7.77s
Running layer_norm benchmark on cuda with 48 workloads. +Running layer_norm benchmark on cuda with 4 workloads. ====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S128_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 117.951us 1284.31% 117.951us 117.951us 1 - torch_layer_norm 8.74% 158.633us 99.57% 1.807ms 1.807ms 0.000us 0.00% 12.352us 12.352us 1 - aten::layer_norm 0.95% 17.160us 90.83% 1.649ms 549.530us 0.000us 0.00% 12.352us 4.117us 3 - aten::native_layer_norm 4.49% 81.559us 89.88% 1.631ms 543.810us 9.184us 100.00% 12.352us 4.117us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 9.184us 100.00% 9.184us 3.061us 3 - Activity Buffer Request 79.88% 1.450ms 79.88% 1.450ms 1.450ms 3.168us 34.49% 3.168us 3.168us 1 - aten::empty 2.58% 46.801us 2.58% 46.801us 5.200us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 2.54% 46.162us 2.54% 46.162us 15.387us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.39% 7.072us 0.39% 7.072us 1.179us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.43% 7.860us 0.43% 7.860us 7.860us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.815ms -Self CUDA time total: 9.184us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S128_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 91.263us 777.10% 91.263us 91.263us 1 - torch_layer_norm 4.45% 73.631us 99.68% 1.650ms 1.650ms 0.000us 0.00% 15.616us 15.616us 1 - aten::layer_norm 0.53% 8.730us 95.23% 1.577ms 525.519us 0.000us 0.00% 15.616us 5.205us 3 - aten::native_layer_norm 3.21% 53.200us 94.70% 1.568ms 522.609us 11.744us 100.00% 15.616us 5.205us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 11.744us 100.00% 11.744us 3.915us 3 - Activity Buffer Request 87.81% 1.454ms 87.81% 1.454ms 1.454ms 3.872us 32.97% 3.872us 3.872us 1 - aten::empty 1.80% 29.853us 1.80% 29.853us 3.317us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 1.64% 27.230us 1.64% 27.230us 9.077us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 3.770us 0.23% 3.770us 0.628us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.32% 5.350us 0.32% 5.350us 5.350us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.656ms -Self CUDA time total: 11.744us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S128_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 93.407us 570.11% 93.407us 93.407us 1 - torch_layer_norm 4.26% 70.071us 99.67% 1.640ms 1.640ms 0.000us 0.00% 21.856us 21.856us 1 - aten::layer_norm 0.57% 9.440us 95.41% 1.570ms 523.176us 0.000us 0.00% 21.856us 7.285us 3 - aten::native_layer_norm 3.17% 52.082us 94.83% 1.560ms 520.029us 16.384us 100.00% 21.856us 7.285us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 16.384us 100.00% 16.384us 5.461us 3 - Activity Buffer Request 87.95% 1.447ms 87.95% 1.447ms 1.447ms 5.472us 33.40% 5.472us 5.472us 1 - aten::empty 1.77% 29.121us 1.77% 29.121us 3.236us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 1.71% 28.080us 1.71% 28.080us 9.360us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.24% 4.030us 0.24% 4.030us 0.672us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.33% 5.460us 0.33% 5.460us 5.460us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.645ms -Self CUDA time total: 16.384us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S128_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 118.239us 440.39% 118.239us 118.239us 1 - torch_layer_norm 5.44% 79.142us 99.61% 1.449ms 1.449ms 0.000us 0.00% 35.810us 35.810us 1 - aten::layer_norm 0.75% 10.900us 94.17% 1.370ms 456.578us 0.000us 0.00% 35.810us 11.937us 3 - aten::native_layer_norm 4.07% 59.211us 93.42% 1.359ms 452.944us 26.849us 100.00% 35.810us 11.937us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 26.849us 100.00% 26.849us 8.950us 3 - Activity Buffer Request 72.70% 1.057ms 72.70% 1.057ms 1.057ms 8.961us 33.38% 8.961us 8.961us 1 - aten::empty 2.44% 35.559us 2.44% 35.559us 3.951us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 13.86% 201.604us 13.86% 201.604us 67.201us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.34% 4.961us 0.34% 4.961us 0.827us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.39% 5.680us 0.39% 5.680us 5.680us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.455ms -Self CUDA time total: 26.849us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S512_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 95.007us 954.65% 95.007us 95.007us 1 - torch_layer_norm 4.08% 72.861us 99.69% 1.782ms 1.782ms 0.000us 0.00% 13.216us 13.216us 1 - aten::layer_norm 0.50% 9.010us 95.61% 1.709ms 569.593us 0.000us 0.00% 13.216us 4.405us 3 - aten::native_layer_norm 3.10% 55.433us 95.11% 1.700ms 566.590us 9.952us 100.00% 13.216us 4.405us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 9.952us 100.00% 9.952us 3.317us 3 - Activity Buffer Request 81.03% 1.448ms 81.03% 1.448ms 1.448ms 3.264us 32.80% 3.264us 3.264us 1 - aten::empty 1.69% 30.250us 1.69% 30.250us 3.361us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 9.05% 161.792us 9.05% 161.792us 53.931us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 4.100us 0.23% 4.100us 0.683us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.31% 5.520us 0.31% 5.520us 5.520us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.787ms -Self CUDA time total: 9.952us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S512_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 88.574us 668.68% 88.574us 88.574us 1 - torch_layer_norm 15.40% 66.901us 98.88% 429.607us 429.607us 0.000us 0.00% 17.629us 17.629us 1 - aten::layer_norm 2.14% 9.290us 83.48% 362.706us 120.902us 0.000us 0.00% 17.629us 5.876us 3 - aten::native_layer_norm 12.03% 52.280us 81.34% 353.416us 117.805us 13.246us 100.00% 17.629us 5.876us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 13.246us 100.00% 13.246us 4.415us 3 - Activity Buffer Request 26.09% 113.362us 26.09% 113.362us 113.362us 4.383us 33.09% 4.383us 4.383us 1 - aten::empty 6.80% 29.541us 6.80% 29.541us 3.282us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 35.53% 154.353us 35.53% 154.353us 51.451us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.89% 3.880us 0.89% 3.880us 0.647us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.12% 4.880us 1.12% 4.880us 4.880us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 434.487us -Self CUDA time total: 13.246us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S512_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 96.609us 488.49% 96.609us 96.609us 1 - torch_layer_norm 4.03% 71.860us 99.72% 1.776ms 1.776ms 0.000us 0.00% 26.305us 26.305us 1 - aten::layer_norm 0.54% 9.591us 95.68% 1.704ms 568.087us 0.000us 0.00% 26.305us 8.768us 3 - aten::native_layer_norm 2.97% 52.832us 95.14% 1.695ms 564.890us 19.777us 100.00% 26.305us 8.768us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 19.777us 100.00% 19.777us 6.592us 3 - Activity Buffer Request 81.50% 1.452ms 81.50% 1.452ms 1.452ms 6.528us 33.01% 6.528us 6.528us 1 - aten::empty 1.62% 28.940us 1.62% 28.940us 3.216us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.82% 157.073us 8.82% 157.073us 52.358us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 4.100us 0.23% 4.100us 0.683us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 5.050us 0.28% 5.050us 5.050us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.781ms -Self CUDA time total: 19.777us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S512_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 101.087us 312.17% 101.087us 101.087us 1 - torch_layer_norm 4.21% 75.141us 99.72% 1.779ms 1.779ms 0.000us 0.00% 43.134us 43.134us 1 - aten::layer_norm 0.50% 9.000us 95.50% 1.703ms 567.803us 0.000us 0.00% 43.134us 14.378us 3 - aten::native_layer_norm 3.03% 54.032us 95.00% 1.694ms 564.803us 32.382us 100.00% 43.134us 14.378us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 32.382us 100.00% 32.382us 10.794us 3 - Activity Buffer Request 81.39% 1.452ms 81.39% 1.452ms 1.452ms 10.752us 33.20% 10.752us 10.752us 1 - aten::empty 1.73% 30.799us 1.73% 30.799us 3.422us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.63% 153.894us 8.63% 153.894us 51.298us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.22% 3.990us 0.22% 3.990us 0.665us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 5.050us 0.28% 5.050us 5.050us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.784ms -Self CUDA time total: 32.382us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 84.605us 738.59% 84.605us 84.605us 1 - torch_layer_norm 14.65% 66.062us 98.90% 446.008us 446.008us 0.000us 0.00% 15.231us 15.231us 1 - aten::layer_norm 1.88% 8.459us 84.25% 379.946us 126.649us 0.000us 0.00% 15.231us 5.077us 3 - aten::native_layer_norm 11.07% 49.901us 82.38% 371.487us 123.829us 11.455us 100.00% 15.231us 5.077us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 11.455us 100.00% 11.455us 3.818us 3 - Activity Buffer Request 30.37% 136.933us 30.37% 136.933us 136.933us 3.776us 32.96% 3.776us 3.776us 1 - aten::empty 6.35% 28.620us 6.35% 28.620us 3.180us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 33.76% 152.233us 33.76% 152.233us 50.744us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.84% 3.800us 0.84% 3.800us 0.633us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.10% 4.941us 1.10% 4.941us 4.941us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 450.949us -Self CUDA time total: 11.455us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 95.615us 580.22% 95.615us 95.615us 1 - torch_layer_norm 3.86% 68.250us 99.72% 1.762ms 1.762ms 0.000us 0.00% 21.951us 21.951us 1 - aten::layer_norm 0.50% 8.771us 95.86% 1.694ms 564.703us 0.000us 0.00% 21.951us 7.317us 3 - aten::native_layer_norm 3.18% 56.263us 95.36% 1.685ms 561.780us 16.479us 100.00% 21.951us 7.317us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 16.479us 100.00% 16.479us 5.493us 3 - Activity Buffer Request 81.70% 1.444ms 81.70% 1.444ms 1.444ms 5.472us 33.21% 5.472us 5.472us 1 - aten::empty 1.62% 28.639us 1.62% 28.639us 3.182us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.61% 152.252us 8.61% 152.252us 50.751us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.24% 4.230us 0.24% 4.230us 0.705us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 4.980us 0.28% 4.980us 4.980us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.767ms -Self CUDA time total: 16.479us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 88.894us 345.94% 88.894us 88.894us 1 - torch_layer_norm 15.31% 64.511us 98.72% 416.027us 416.027us 0.000us 0.00% 34.240us 34.240us 1 - aten::layer_norm 2.02% 8.530us 83.41% 351.516us 117.172us 0.000us 0.00% 34.240us 11.413us 3 - aten::native_layer_norm 12.31% 51.881us 81.39% 342.986us 114.329us 25.696us 100.00% 34.240us 11.413us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 25.696us 100.00% 25.696us 8.565us 3 - Activity Buffer Request 25.35% 106.822us 25.35% 106.822us 106.822us 8.544us 33.25% 8.544us 8.544us 1 - aten::empty 6.69% 28.191us 6.69% 28.191us 3.132us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 36.17% 152.423us 36.17% 152.423us 50.808us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.87% 3.669us 0.87% 3.669us 0.612us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.28% 5.400us 1.28% 5.400us 5.400us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 421.427us -Self CUDA time total: 25.696us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S1024_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.99% 70.451us 99.68% 1.760ms 1.760ms 0.000us 0.00% 110.273us 110.273us 1 - aten::layer_norm 0.54% 9.469us 95.69% 1.690ms 563.186us 0.000us 0.00% 110.273us 36.758us 3 - aten::native_layer_norm 2.91% 51.321us 95.15% 1.680ms 560.030us 70.464us 100.00% 110.273us 36.758us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 104.384us 148.14% 104.384us 104.384us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 70.464us 100.00% 70.464us 23.488us 3 - Activity Buffer Request 81.54% 1.440ms 81.54% 1.440ms 1.440ms 39.809us 56.50% 39.809us 39.809us 1 - aten::empty 1.69% 29.812us 1.69% 29.812us 3.312us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.79% 155.141us 8.79% 155.141us 51.714us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 4.141us 0.23% 4.141us 0.690us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.32% 5.631us 0.32% 5.631us 5.631us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.766ms -Self CUDA time total: 70.464us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 94.879us 526.67% 94.879us 94.879us 1 - torch_layer_norm 3.90% 69.211us 99.68% 1.768ms 1.768ms 0.000us 0.00% 23.935us 23.935us 1 - aten::layer_norm 0.53% 9.340us 95.78% 1.699ms 566.293us 0.000us 0.00% 23.935us 7.978us 3 - aten::native_layer_norm 2.96% 52.430us 95.26% 1.690ms 563.180us 18.015us 100.00% 23.935us 7.978us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 18.015us 100.00% 18.015us 6.005us 3 - Activity Buffer Request 81.67% 1.449ms 81.67% 1.449ms 1.449ms 5.920us 32.86% 5.920us 5.920us 1 - aten::empty 1.69% 29.991us 1.69% 29.991us 3.332us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.72% 154.594us 8.72% 154.594us 51.531us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.22% 3.890us 0.22% 3.890us 0.648us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.32% 5.590us 0.32% 5.590us 5.590us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.774ms -Self CUDA time total: 18.015us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 92.671us 343.53% 92.671us 92.671us 1 - torch_layer_norm 14.22% 66.652us 98.98% 463.858us 463.858us 0.000us 0.00% 35.872us 35.872us 1 - aten::layer_norm 1.92% 9.009us 84.76% 397.206us 132.402us 0.000us 0.00% 35.872us 11.957us 3 - aten::native_layer_norm 11.29% 52.919us 82.83% 388.197us 129.399us 26.976us 100.00% 35.872us 11.957us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 26.976us 100.00% 26.976us 8.992us 3 - Activity Buffer Request 32.20% 150.883us 32.20% 150.883us 150.883us 8.896us 32.98% 8.896us 8.896us 1 - aten::empty 6.01% 28.182us 6.01% 28.182us 3.131us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 32.49% 152.273us 32.49% 152.273us 50.758us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.84% 3.940us 0.84% 3.940us 0.657us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.02% 4.791us 1.02% 4.791us 4.791us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 468.649us -Self CUDA time total: 26.976us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 133.341us 184.87% 133.341us 133.341us 1 - torch_layer_norm 3.93% 69.900us 99.72% 1.772ms 1.772ms 0.000us 0.00% 112.892us 112.892us 1 - aten::layer_norm 0.55% 9.790us 95.79% 1.702ms 567.350us 0.000us 0.00% 112.892us 37.631us 3 - aten::native_layer_norm 3.28% 58.200us 95.24% 1.692ms 564.087us 72.125us 100.00% 112.892us 37.631us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 72.125us 100.00% 72.125us 24.042us 3 - Activity Buffer Request 80.05% 1.422ms 80.05% 1.422ms 1.422ms 40.767us 56.52% 40.767us 40.767us 1 - aten::empty 1.64% 29.113us 1.64% 29.113us 3.235us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 10.01% 177.823us 10.01% 177.823us 59.274us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.27% 4.770us 0.27% 4.770us 0.795us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 4.900us 0.28% 4.900us 4.900us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.777ms -Self CUDA time total: 72.125us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B1_S2048_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 14.68% 65.741us 95.47% 427.658us 427.658us 0.000us 0.00% 230.621us 230.621us 1 - aten::layer_norm 2.04% 9.121us 80.79% 361.917us 120.639us 0.000us 0.00% 230.621us 76.874us 3 - aten::native_layer_norm 11.17% 50.059us 78.75% 352.796us 117.599us 144.510us 100.00% 230.621us 76.874us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 146.014us 101.04% 146.014us 146.014us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 144.510us 100.00% 144.510us 48.170us 3 - Activity Buffer Request 26.04% 116.642us 26.04% 116.642us 116.642us 86.111us 59.59% 86.111us 86.111us 1 - aten::empty 6.43% 28.811us 6.43% 28.811us 3.201us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 34.20% 153.184us 34.20% 153.184us 51.061us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.92% 4.100us 0.92% 4.100us 0.683us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 4.53% 20.311us 4.53% 20.311us 20.311us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 447.969us -Self CUDA time total: 144.510us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S128_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 92.096us 943.61% 92.096us 92.096us 1 - torch_layer_norm 3.85% 68.512us 99.73% 1.773ms 1.773ms 0.000us 0.00% 12.864us 12.864us 1 - aten::layer_norm 0.55% 9.759us 95.87% 1.705ms 568.216us 0.000us 0.00% 12.864us 4.288us 3 - aten::native_layer_norm 3.00% 53.309us 95.32% 1.695ms 564.963us 9.760us 100.00% 12.864us 4.288us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 9.760us 100.00% 9.760us 3.253us 3 - Activity Buffer Request 81.26% 1.445ms 81.26% 1.445ms 1.445ms 3.104us 31.80% 3.104us 3.104us 1 - aten::empty 1.70% 30.172us 1.70% 30.172us 3.352us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 9.14% 162.452us 9.14% 162.452us 54.151us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.24% 4.201us 0.24% 4.201us 0.700us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.27% 4.880us 0.27% 4.880us 4.880us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.778ms -Self CUDA time total: 9.760us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S128_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 91.521us 709.63% 91.521us 91.521us 1 - torch_layer_norm 4.32% 76.641us 99.71% 1.771ms 1.771ms 0.000us 0.00% 17.186us 17.186us 1 - aten::layer_norm 0.52% 9.251us 95.40% 1.694ms 564.620us 0.000us 0.00% 17.186us 5.729us 3 - aten::native_layer_norm 2.94% 52.208us 94.87% 1.685ms 561.536us 12.897us 100.00% 17.186us 5.729us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 12.897us 100.00% 12.897us 4.299us 3 - Activity Buffer Request 81.35% 1.444ms 81.35% 1.444ms 1.444ms 4.289us 33.26% 4.289us 4.289us 1 - aten::empty 1.65% 29.223us 1.65% 29.223us 3.247us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.72% 154.793us 8.72% 154.793us 51.598us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.22% 3.890us 0.22% 3.890us 0.648us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.29% 5.110us 0.29% 5.110us 5.110us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.776ms -Self CUDA time total: 12.897us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S128_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 88.130us 448.50% 88.130us 88.130us 1 - torch_layer_norm 11.06% 64.130us 99.16% 575.190us 575.190us 0.000us 0.00% 26.147us 26.147us 1 - aten::layer_norm 1.59% 9.222us 88.10% 511.060us 170.353us 0.000us 0.00% 26.147us 8.716us 3 - aten::native_layer_norm 8.61% 49.940us 86.51% 501.838us 167.279us 19.650us 100.00% 26.147us 8.716us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 19.650us 100.00% 19.650us 6.550us 3 - Activity Buffer Request 45.46% 263.724us 45.46% 263.724us 263.724us 6.497us 33.06% 6.497us 6.497us 1 - aten::empty 4.97% 28.852us 4.97% 28.852us 3.206us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 26.69% 154.833us 26.69% 154.833us 51.611us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.77% 4.489us 0.77% 4.489us 0.748us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.84% 4.880us 0.84% 4.880us 4.880us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 580.070us -Self CUDA time total: 19.650us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S128_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 92.576us 290.74% 92.576us 92.576us 1 - torch_layer_norm 10.78% 63.911us 99.14% 587.520us 587.520us 0.000us 0.00% 42.562us 42.562us 1 - aten::layer_norm 1.44% 8.510us 88.35% 523.609us 174.536us 0.000us 0.00% 42.562us 14.187us 3 - aten::native_layer_norm 8.62% 51.095us 86.92% 515.099us 171.700us 31.841us 100.00% 42.562us 14.187us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 31.841us 100.00% 31.841us 10.614us 3 - Activity Buffer Request 46.87% 277.744us 46.87% 277.744us 277.744us 10.721us 33.67% 10.721us 10.721us 1 - aten::empty 4.75% 28.169us 4.75% 28.169us 3.130us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 25.92% 153.632us 25.92% 153.632us 51.211us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.75% 4.459us 0.75% 4.459us 0.743us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.86% 5.110us 0.86% 5.110us 5.110us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 592.630us -Self CUDA time total: 31.841us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S512_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 95.776us 539.28% 95.776us 95.776us 1 - torch_layer_norm 13.84% 112.583us 99.26% 807.595us 807.595us 0.000us 0.00% 23.680us 23.680us 1 - aten::layer_norm 1.40% 11.400us 85.42% 695.012us 231.671us 0.000us 0.00% 23.680us 7.893us 3 - aten::native_layer_norm 7.57% 61.601us 84.02% 683.612us 227.871us 17.760us 100.00% 23.680us 7.893us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 17.760us 100.00% 17.760us 5.920us 3 - Activity Buffer Request 33.76% 274.664us 33.76% 274.664us 274.664us 5.920us 33.33% 5.920us 5.920us 1 - aten::empty 3.69% 30.062us 3.69% 30.062us 3.340us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 38.34% 311.955us 38.34% 311.955us 103.985us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.66% 5.330us 0.66% 5.330us 0.888us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.74% 6.030us 0.74% 6.030us 6.030us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 813.625us -Self CUDA time total: 17.760us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S512_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 96.383us 353.93% 96.383us 96.383us 1 - torch_layer_norm 4.14% 80.990us 99.72% 1.949ms 1.949ms 0.000us 0.00% 36.288us 36.288us 1 - aten::layer_norm 0.49% 9.631us 95.58% 1.868ms 622.648us 0.000us 0.00% 36.288us 12.096us 3 - aten::native_layer_norm 2.77% 54.113us 95.09% 1.858ms 619.438us 27.232us 100.00% 36.288us 12.096us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 27.232us 100.00% 27.232us 9.077us 3 - Activity Buffer Request 75.84% 1.482ms 75.84% 1.482ms 1.482ms 9.056us 33.25% 9.056us 9.056us 1 - aten::empty 1.50% 29.320us 1.50% 29.320us 3.258us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 14.76% 288.535us 14.76% 288.535us 96.178us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.22% 4.249us 0.22% 4.249us 0.708us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 5.411us 0.28% 5.411us 5.411us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.954ms -Self CUDA time total: 27.232us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S512_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.80% 69.480us 99.73% 1.822ms 1.822ms 0.000us 0.00% 112.641us 112.641us 1 - aten::layer_norm 0.50% 9.151us 95.93% 1.752ms 584.111us 0.000us 0.00% 112.641us 37.547us 3 - aten::native_layer_norm 2.81% 51.420us 95.43% 1.743ms 581.060us 72.033us 100.00% 112.641us 37.547us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 101.696us 141.18% 101.696us 101.696us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 72.033us 100.00% 72.033us 24.011us 3 - Activity Buffer Request 80.53% 1.471ms 80.53% 1.471ms 1.471ms 40.608us 56.37% 40.608us 40.608us 1 - aten::empty 1.60% 29.163us 1.60% 29.163us 3.240us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 10.27% 187.683us 10.27% 187.683us 62.561us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.22% 3.950us 0.22% 3.950us 0.658us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.27% 4.880us 0.27% 4.880us 4.880us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.827ms -Self CUDA time total: 72.033us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S512_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.85% 68.680us 99.71% 1.780ms 1.780ms 0.000us 0.00% 229.955us 229.955us 1 - aten::layer_norm 0.61% 10.850us 95.86% 1.711ms 570.370us 0.000us 0.00% 229.955us 76.652us 3 - aten::native_layer_norm 3.11% 55.560us 95.26% 1.700ms 566.754us 144.066us 100.00% 229.955us 76.652us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 145.569us 101.04% 145.569us 145.569us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 144.066us 100.00% 144.066us 48.022us 3 - Activity Buffer Request 79.52% 1.419ms 79.52% 1.419ms 1.419ms 85.889us 59.62% 85.889us 85.889us 1 - aten::empty 1.71% 30.551us 1.71% 30.551us 3.395us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 10.67% 190.375us 10.67% 190.375us 63.458us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.24% 4.330us 0.24% 4.330us 0.722us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.29% 5.130us 0.29% 5.130us 5.130us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.785ms -Self CUDA time total: 144.066us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 115.904us 398.90% 115.904us 115.904us 1 - torch_layer_norm 4.36% 77.971us 99.69% 1.781ms 1.781ms 0.000us 0.00% 38.656us 38.656us 1 - aten::layer_norm 0.59% 10.570us 95.33% 1.703ms 567.730us 0.000us 0.00% 38.656us 12.885us 3 - aten::native_layer_norm 3.31% 59.081us 94.74% 1.693ms 564.207us 29.056us 100.00% 38.656us 12.885us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 29.056us 100.00% 29.056us 9.685us 3 - Activity Buffer Request 80.03% 1.430ms 80.03% 1.430ms 1.430ms 9.600us 33.04% 9.600us 9.600us 1 - aten::empty 1.84% 32.962us 1.84% 32.962us 3.662us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 9.29% 165.972us 9.29% 165.972us 55.324us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.27% 4.790us 0.27% 4.790us 0.798us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.31% 5.470us 0.31% 5.470us 5.470us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.787ms -Self CUDA time total: 29.056us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 14.07% 64.760us 98.95% 455.588us 455.588us 0.000us 0.00% 101.120us 101.120us 1 - aten::layer_norm 1.91% 8.791us 84.88% 390.828us 130.276us 0.000us 0.00% 101.120us 33.707us 3 - aten::native_layer_norm 11.79% 54.281us 82.97% 382.037us 127.346us 65.344us 100.00% 101.120us 33.707us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 96.510us 147.70% 96.510us 96.510us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 65.344us 100.00% 65.344us 21.781us 3 - Activity Buffer Request 29.77% 137.072us 29.77% 137.072us 137.072us 35.776us 54.75% 35.776us 35.776us 1 - aten::empty 6.60% 30.402us 6.60% 30.402us 3.378us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 33.93% 156.232us 33.93% 156.232us 52.077us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.88% 4.050us 0.88% 4.050us 0.675us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.05% 4.840us 1.05% 4.840us 4.840us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 460.428us -Self CUDA time total: 65.344us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.83% 67.811us 99.72% 1.767ms 1.767ms 0.000us 0.00% 207.840us 207.840us 1 - aten::layer_norm 0.55% 9.819us 95.89% 1.699ms 566.320us 0.000us 0.00% 207.840us 69.280us 3 - aten::native_layer_norm 3.03% 53.603us 95.34% 1.689ms 563.047us 129.312us 100.00% 207.840us 69.280us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 130.911us 101.24% 130.911us 130.911us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 129.312us 100.00% 129.312us 43.104us 3 - Activity Buffer Request 81.49% 1.444ms 81.49% 1.444ms 1.444ms 78.528us 60.73% 78.528us 78.528us 1 - aten::empty 1.74% 30.830us 1.74% 30.830us 3.426us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.86% 156.973us 8.86% 156.973us 52.324us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 4.020us 0.23% 4.020us 0.670us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 4.980us 0.28% 4.980us 4.980us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.772ms -Self CUDA time total: 129.312us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S1024_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.13% 68.611us 81.17% 1.779ms 1.779ms 0.000us 0.00% 737.526us 737.526us 1 - aten::layer_norm 0.41% 9.061us 78.04% 1.711ms 570.260us 0.000us 0.00% 737.526us 245.842us 3 - aten::native_layer_norm 2.43% 53.328us 77.62% 1.702ms 567.240us 547.705us 100.00% 737.526us 245.842us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 549.241us 100.28% 549.241us 549.241us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 547.705us 100.00% 547.705us 182.568us 3 - Activity Buffer Request 66.39% 1.455ms 66.39% 1.455ms 1.455ms 189.821us 34.66% 189.821us 189.821us 1 - aten::empty 1.36% 29.741us 1.36% 29.741us 3.305us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 7.27% 159.364us 7.27% 159.364us 53.121us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.18% 3.911us 0.18% 3.911us 0.652us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 18.83% 412.857us 18.83% 412.857us 412.857us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 2.192ms -Self CUDA time total: 547.705us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 13.81% 64.951us 98.91% 465.198us 465.198us 0.000us 0.00% 102.813us 102.813us 1 - aten::layer_norm 2.00% 9.429us 85.10% 400.247us 133.416us 0.000us 0.00% 102.813us 34.271us 3 - aten::native_layer_norm 10.88% 51.150us 83.10% 390.818us 130.273us 68.606us 100.00% 102.813us 34.271us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 100.893us 147.06% 100.893us 100.893us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 68.606us 100.00% 68.606us 22.869us 3 - Activity Buffer Request 31.07% 146.142us 31.07% 146.142us 146.142us 34.207us 49.86% 34.207us 34.207us 1 - aten::empty 6.17% 29.002us 6.17% 29.002us 3.222us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 34.16% 160.644us 34.16% 160.644us 53.548us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.82% 3.880us 0.82% 3.880us 0.647us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.09% 5.121us 1.09% 5.121us 5.121us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 470.319us -Self CUDA time total: 68.606us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.85% 67.820us 99.72% 1.755ms 1.755ms 0.000us 0.00% 204.288us 204.288us 1 - aten::layer_norm 0.52% 9.151us 95.86% 1.687ms 562.280us 0.000us 0.00% 204.288us 68.096us 3 - aten::native_layer_norm 2.95% 51.910us 95.34% 1.678ms 559.230us 129.120us 100.00% 204.288us 68.096us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 130.560us 101.12% 130.560us 130.560us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 129.120us 100.00% 129.120us 43.040us 3 - Activity Buffer Request 81.69% 1.437ms 81.69% 1.437ms 1.437ms 75.168us 58.22% 75.168us 75.168us 1 - aten::empty 1.73% 30.362us 1.73% 30.362us 3.374us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.76% 154.112us 8.76% 154.112us 51.371us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.22% 3.910us 0.22% 3.910us 0.652us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.28% 4.960us 0.28% 4.960us 4.960us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.760ms -Self CUDA time total: 129.120us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.24% 70.231us 80.97% 1.754ms 1.754ms 0.000us 0.00% 714.792us 714.792us 1 - aten::layer_norm 0.42% 9.200us 77.73% 1.684ms 561.233us 0.000us 0.00% 714.792us 238.264us 3 - aten::native_layer_norm 2.38% 51.610us 77.31% 1.674ms 558.166us 542.598us 100.00% 714.792us 238.264us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 544.071us 100.27% 544.071us 544.071us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 542.598us 100.00% 542.598us 180.866us 3 - Activity Buffer Request 66.26% 1.435ms 66.26% 1.435ms 1.435ms 172.194us 31.74% 172.194us 172.194us 1 - aten::empty 1.34% 28.942us 1.34% 28.942us 3.216us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 7.14% 154.623us 7.14% 154.623us 51.541us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.19% 4.030us 0.19% 4.030us 0.672us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 19.03% 412.116us 19.03% 412.116us 412.116us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 2.166ms -Self CUDA time total: 542.598us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B4_S2048_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 2.50% 69.210us 63.28% 1.753ms 1.753ms 0.000us 0.00% 1.482ms 1.482ms 1 - aten::layer_norm 0.34% 9.550us 60.78% 1.684ms 561.333us 0.000us 0.00% 1.482ms 494.135us 3 - aten::native_layer_norm 1.89% 52.442us 60.43% 1.674ms 558.150us 1.150ms 100.00% 1.482ms 494.135us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 1.151ms 100.12% 1.151ms 1.151ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 1.150ms 100.00% 1.150ms 383.212us 3 - Activity Buffer Request 51.68% 1.432ms 51.68% 1.432ms 1.432ms 332.769us 28.95% 332.769us 332.769us 1 - aten::empty 1.10% 30.460us 1.10% 30.460us 3.384us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 5.62% 155.772us 5.62% 155.772us 51.924us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.14% 3.891us 0.14% 3.891us 0.649us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 36.72% 1.018ms 36.72% 1.018ms 1.018ms 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 2.771ms -Self CUDA time total: 1.150ms - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S128_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 86.813us 481.04% 86.813us 86.813us 1 - torch_layer_norm 13.94% 63.610us 98.78% 450.788us 450.788us 0.000us 0.00% 23.966us 23.966us 1 - aten::layer_norm 1.92% 8.751us 84.84% 387.178us 129.059us 0.000us 0.00% 23.966us 7.989us 3 - aten::native_layer_norm 11.33% 51.701us 82.93% 378.427us 126.142us 18.047us 100.00% 23.966us 7.989us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 18.047us 100.00% 18.047us 6.016us 3 - Activity Buffer Request 30.87% 140.892us 30.87% 140.892us 140.892us 5.919us 32.80% 5.919us 5.919us 1 - aten::empty 6.07% 27.691us 6.07% 27.691us 3.077us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 33.75% 154.013us 33.75% 154.013us 51.338us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.91% 4.130us 0.91% 4.130us 0.688us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.22% 5.560us 1.22% 5.560us 5.560us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 456.348us -Self CUDA time total: 18.047us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S128_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 94.272us 347.01% 94.272us 94.272us 1 - torch_layer_norm 3.87% 67.581us 99.70% 1.743ms 1.743ms 0.000us 0.00% 36.063us 36.063us 1 - aten::layer_norm 0.54% 9.410us 95.84% 1.675ms 558.423us 0.000us 0.00% 36.063us 12.021us 3 - aten::native_layer_norm 3.00% 52.431us 95.30% 1.666ms 555.286us 27.167us 100.00% 36.063us 12.021us 3 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 27.167us 100.00% 27.167us 9.056us 3 - Activity Buffer Request 81.64% 1.427ms 81.64% 1.427ms 1.427ms 8.896us 32.75% 8.896us 8.896us 1 - aten::empty 1.64% 28.640us 1.64% 28.640us 3.182us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.79% 153.563us 8.79% 153.563us 51.188us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 4.090us 0.23% 4.090us 0.682us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.30% 5.160us 0.30% 5.160us 5.160us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.748ms -Self CUDA time total: 27.167us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S128_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 15.30% 64.290us 98.85% 415.327us 415.327us 0.000us 0.00% 113.182us 113.182us 1 - aten::layer_norm 1.89% 7.931us 83.55% 351.037us 117.012us 0.000us 0.00% 113.182us 37.727us 3 - aten::native_layer_norm 12.15% 51.059us 81.66% 343.106us 114.369us 72.639us 100.00% 113.182us 37.727us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 97.758us 134.58% 97.758us 97.758us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 72.639us 100.00% 72.639us 24.213us 3 - Activity Buffer Request 25.15% 105.652us 25.15% 105.652us 105.652us 40.543us 55.81% 40.543us 40.543us 1 - aten::empty 7.08% 29.763us 7.08% 29.763us 3.307us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 36.37% 152.792us 36.37% 152.792us 50.931us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.91% 3.840us 0.91% 3.840us 0.640us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 1.15% 4.831us 1.15% 4.831us 4.831us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 420.158us -Self CUDA time total: 72.639us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S128_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.89% 68.361us 99.32% 1.748ms 1.748ms 0.000us 0.00% 226.432us 226.432us 1 - aten::layer_norm 0.51% 8.970us 95.44% 1.679ms 559.750us 0.000us 0.00% 226.432us 75.477us 3 - aten::native_layer_norm 3.03% 53.343us 94.93% 1.670ms 556.760us 142.207us 100.00% 226.432us 75.477us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 143.552us 100.95% 143.552us 143.552us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 142.207us 100.00% 142.207us 47.402us 3 - Activity Buffer Request 81.27% 1.430ms 81.27% 1.430ms 1.430ms 84.225us 59.23% 84.225us 84.225us 1 - aten::empty 1.69% 29.760us 1.69% 29.760us 3.307us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.71% 153.172us 8.71% 153.172us 51.057us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.23% 4.080us 0.23% 4.080us 0.680us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.68% 11.911us 0.68% 11.911us 11.911us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.760ms -Self CUDA time total: 142.207us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S512_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 3.86% 67.581us 99.71% 1.745ms 1.745ms 0.000us 0.00% 103.967us 103.967us 1 - aten::layer_norm 0.51% 8.910us 95.84% 1.677ms 559.073us 0.000us 0.00% 103.967us 34.656us 3 - aten::native_layer_norm 3.07% 53.660us 95.33% 1.668ms 556.103us 69.343us 100.00% 103.967us 34.656us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 103.487us 149.24% 103.487us 103.487us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 69.343us 100.00% 69.343us 23.114us 3 - Activity Buffer Request 81.52% 1.427ms 81.52% 1.427ms 1.427ms 34.624us 49.93% 34.624us 34.624us 1 - aten::empty 1.61% 28.261us 1.61% 28.261us 3.140us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 8.90% 155.753us 8.90% 155.753us 51.918us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.24% 4.120us 0.24% 4.120us 0.687us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.29% 5.151us 0.29% 5.151us 5.151us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.750ms -Self CUDA time total: 69.343us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S512_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 11.35% 67.490us 99.15% 589.690us 589.690us 0.000us 0.00% 202.330us 202.330us 1 - aten::layer_norm 1.44% 8.590us 87.80% 522.200us 174.067us 0.000us 0.00% 202.330us 67.443us 3 - aten::native_layer_norm 8.41% 50.041us 86.35% 513.610us 171.203us 128.124us 100.00% 202.330us 67.443us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 129.692us 101.22% 129.692us 129.692us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 128.124us 100.00% 128.124us 42.708us 3 - Activity Buffer Request 46.63% 277.315us 46.63% 277.315us 277.315us 74.206us 57.92% 74.206us 74.206us 1 - aten::empty 4.68% 27.831us 4.68% 27.831us 3.092us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 25.89% 153.973us 25.89% 153.973us 51.324us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.75% 4.450us 0.75% 4.450us 0.742us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 0.85% 5.080us 0.85% 5.080us 5.080us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 594.770us -Self CUDA time total: 128.124us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S512_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 6.87% 68.511us 58.17% 579.770us 579.770us 0.000us 0.00% 720.407us 720.407us 1 - aten::layer_norm 0.88% 8.821us 51.29% 511.259us 170.420us 0.000us 0.00% 720.407us 240.136us 3 - aten::native_layer_norm 5.17% 51.521us 50.41% 502.438us 167.479us 546.073us 100.00% 720.407us 240.136us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 547.577us 100.28% 547.577us 547.577us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 546.073us 100.00% 546.073us 182.024us 3 - Activity Buffer Request 26.52% 264.294us 26.52% 264.294us 264.294us 174.334us 31.93% 174.334us 174.334us 1 - aten::empty 2.91% 29.030us 2.91% 29.030us 3.226us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 15.39% 153.384us 15.39% 153.384us 51.128us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.42% 4.209us 0.42% 4.209us 0.702us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 41.83% 416.987us 41.83% 416.987us 416.987us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 996.757us -Self CUDA time total: 546.073us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S512_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 4.10% 64.241us 34.57% 541.829us 541.829us 0.000us 0.00% 1.480ms 1.480ms 1 - aten::layer_norm 0.55% 8.560us 30.47% 477.588us 159.196us 0.000us 0.00% 1.480ms 493.436us 3 - aten::native_layer_norm 3.24% 50.830us 29.93% 469.028us 156.343us 1.149ms 100.00% 1.480ms 493.436us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 1.151ms 100.12% 1.151ms 1.151ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 1.149ms 100.00% 1.149ms 383.133us 3 - Activity Buffer Request 14.86% 232.814us 14.86% 232.814us 232.814us 330.909us 28.79% 330.909us 330.909us 1 - aten::empty 1.86% 29.081us 1.86% 29.081us 3.231us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 9.70% 152.022us 9.70% 152.022us 50.674us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.27% 4.281us 0.27% 4.281us 0.713us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 65.43% 1.025ms 65.43% 1.025ms 1.025ms 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.567ms -Self CUDA time total: 1.149ms - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D1024 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 10.87% 65.290us 97.50% 585.660us 585.660us 0.000us 0.00% 211.160us 211.160us 1 - aten::layer_norm 1.49% 8.961us 86.63% 520.370us 173.457us 0.000us 0.00% 211.160us 70.387us 3 - aten::native_layer_norm 8.59% 51.600us 85.14% 511.409us 170.470us 139.579us 100.00% 211.160us 70.387us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 140.987us 101.01% 140.987us 140.987us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 139.579us 100.00% 139.579us 46.526us 3 - Activity Buffer Request 45.81% 275.144us 45.81% 275.144us 275.144us 71.581us 51.28% 71.581us 71.581us 1 - aten::empty 4.65% 27.942us 4.65% 27.942us 3.105us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 25.42% 152.693us 25.42% 152.693us 50.898us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.67% 4.030us 0.67% 4.030us 0.672us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 2.50% 14.990us 2.50% 14.990us 14.990us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 600.650us -Self CUDA time total: 139.579us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D2048 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 6.53% 63.420us 56.04% 544.209us 544.209us 0.000us 0.00% 725.021us 725.021us 1 - aten::layer_norm 0.90% 8.770us 49.51% 480.789us 160.263us 0.000us 0.00% 725.021us 241.674us 3 - aten::native_layer_norm 5.25% 50.982us 48.61% 472.019us 157.340us 551.902us 100.00% 725.021us 241.674us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 553.342us 100.26% 553.342us 553.342us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 551.902us 100.00% 551.902us 183.967us 3 - Activity Buffer Request 24.17% 234.744us 24.17% 234.744us 234.744us 173.119us 31.37% 173.119us 173.119us 1 - aten::empty 3.03% 29.450us 3.03% 29.450us 3.272us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 15.70% 152.482us 15.70% 152.482us 50.827us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.45% 4.361us 0.45% 4.361us 0.727us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 43.96% 426.887us 43.96% 426.887us 426.887us 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 971.096us -Self CUDA time total: 551.902us - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D4096 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 4.07% 66.881us 38.72% 635.751us 635.751us 0.000us 0.00% 1.469ms 1.469ms 1 - aten::layer_norm 0.55% 9.009us 34.64% 568.870us 189.623us 0.000us 0.00% 1.469ms 489.666us 3 - aten::native_layer_norm 3.27% 53.630us 34.10% 559.861us 186.620us 1.138ms 100.00% 1.469ms 489.666us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 1.139ms 100.13% 1.139ms 1.139ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 1.138ms 100.00% 1.138ms 379.279us 3 - Activity Buffer Request 19.12% 313.985us 19.12% 313.985us 313.985us 331.162us 29.10% 331.162us 331.162us 1 - aten::empty 1.88% 30.903us 1.88% 30.903us 3.434us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 9.57% 157.133us 9.57% 157.133us 52.378us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.26% 4.210us 0.26% 4.210us 0.702us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 61.28% 1.006ms 61.28% 1.006ms 1.006ms 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.642ms -Self CUDA time total: 1.138ms - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S1024_D8192 -====================================================================== -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 2.42% 65.690us 15.85% 430.707us 430.707us 0.000us 0.00% 3.155ms 3.155ms 1 - aten::layer_norm 0.35% 9.490us 13.44% 365.017us 121.672us 0.000us 0.00% 3.155ms 1.052ms 3 - aten::native_layer_norm 1.79% 48.727us 13.09% 355.527us 118.509us 2.409ms 100.00% 3.155ms 1.052ms 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 2.410ms 100.06% 2.410ms 2.410ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 2.409ms 100.00% 2.409ms 802.859us 3 - Activity Buffer Request 4.38% 118.922us 4.38% 118.922us 118.922us 746.656us 31.00% 746.656us 746.656us 1 - aten::empty 1.13% 30.624us 1.13% 30.624us 3.403us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 5.65% 153.412us 5.65% 153.412us 51.137us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.14% 3.842us 0.14% 3.842us 0.640us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 84.15% 2.286ms 84.15% 2.286ms 2.286ms 0.000us 0.00% 0.000us 0.000us 1 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 2.717ms -Self CUDA time total: 2.409ms - - - -====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D1024 +PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096 ====================================================================== ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 6.72% 66.011us 55.62% 546.350us 546.350us 0.000us 0.00% 735.937us 735.937us 1 - aten::layer_norm 0.92% 8.990us 48.90% 480.339us 160.113us 0.000us 0.00% 735.937us 245.312us 3 - aten::native_layer_norm 5.16% 50.724us 47.98% 471.349us 157.116us 560.097us 100.00% 735.937us 245.312us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 561.633us 100.27% 561.633us 561.633us 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 560.097us 100.00% 560.097us 186.699us 3 - Activity Buffer Request 23.82% 234.014us 23.82% 234.014us 234.014us 175.840us 31.39% 175.840us 175.840us 1 - aten::empty 2.88% 28.270us 2.88% 28.270us 3.141us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 15.72% 154.402us 15.72% 154.402us 51.467us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.40% 3.939us 0.40% 3.939us 0.656us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 44.38% 435.997us 44.38% 435.997us 435.997us 0.000us 0.00% 0.000us 0.000us 1 + torch_layer_norm 3.94% 153.126us 46.06% 1.791ms 1.791ms 0.000us 0.00% 3.027ms 3.027ms 1 + aten::layer_norm 0.44% 17.151us 42.12% 1.638ms 545.972us 0.000us 0.00% 3.027ms 1.009ms 3 + aten::native_layer_norm 1.99% 77.265us 41.68% 1.621ms 540.255us 2.317ms 100.00% 3.027ms 1.009ms 3 + torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 2.318ms 100.06% 2.318ms 2.318ms 1 +void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 2.317ms 100.00% 2.317ms 772.230us 3 + Activity Buffer Request 37.14% 1.444ms 37.14% 1.444ms 1.444ms 709.980us 30.65% 709.980us 709.980us 1 + aten::empty 1.21% 46.960us 1.21% 46.960us 5.218us 0.000us 0.00% 0.000us 0.000us 9 + cudaLaunchKernel 1.16% 45.271us 1.16% 45.271us 15.090us 0.000us 0.00% 0.000us 0.000us 3 + aten::view 0.18% 7.130us 0.18% 7.130us 1.188us 0.000us 0.00% 0.000us 0.000us 6 + cudaDeviceSynchronize 53.94% 2.098ms 53.94% 2.098ms 2.098ms 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 982.347us -Self CUDA time total: 560.097us +Self CPU time total: 3.889ms +Self CUDA time total: 2.317ms ====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D2048 +PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192 ====================================================================== ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 4.56% 64.832us 29.06% 412.897us 412.897us 0.000us 0.00% 1.469ms 1.469ms 1 - aten::layer_norm 0.65% 9.228us 24.50% 348.065us 116.022us 0.000us 0.00% 1.469ms 489.663us 3 - aten::native_layer_norm 3.69% 52.410us 23.85% 338.837us 112.946us 1.133ms 100.00% 1.469ms 489.663us 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 1.135ms 100.12% 1.135ms 1.135ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 1.133ms 100.00% 1.133ms 377.716us 3 - Activity Buffer Request 7.07% 100.442us 7.07% 100.442us 100.442us 335.839us 29.64% 335.839us 335.839us 1 - aten::empty 2.06% 29.311us 2.06% 29.311us 3.257us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 10.76% 152.823us 10.76% 152.823us 50.941us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.27% 3.851us 0.27% 3.851us 0.642us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 70.94% 1.008ms 70.94% 1.008ms 1.008ms 0.000us 0.00% 0.000us 0.000us 1 + torch_layer_norm 1.11% 71.092us 25.40% 1.622ms 1.622ms 0.000us 0.00% 6.494ms 6.494ms 1 + aten::layer_norm 0.16% 10.119us 24.29% 1.551ms 517.038us 0.000us 0.00% 6.494ms 2.165ms 3 + aten::native_layer_norm 0.82% 52.103us 24.13% 1.541ms 513.665us 4.898ms 100.00% 6.494ms 2.165ms 3 + torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 4.899ms 100.03% 4.899ms 4.899ms 1 +void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 4.898ms 100.00% 4.898ms 1.633ms 3 + Activity Buffer Request 22.36% 1.428ms 22.36% 1.428ms 1.428ms 1.596ms 32.59% 1.596ms 1.596ms 1 + aten::empty 0.49% 31.052us 0.49% 31.052us 3.450us 0.000us 0.00% 0.000us 0.000us 9 + cudaLaunchKernel 0.41% 26.160us 0.41% 26.160us 8.720us 0.000us 0.00% 0.000us 0.000us 3 + aten::view 0.06% 3.830us 0.06% 3.830us 0.638us 0.000us 0.00% 0.000us 0.000us 6 + cudaDeviceSynchronize 74.60% 4.764ms 74.60% 4.764ms 4.764ms 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 1.421ms -Self CUDA time total: 1.133ms +Self CPU time total: 6.386ms +Self CUDA time total: 4.898ms ====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D4096 +PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D4096 ====================================================================== ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 2.43% 67.770us 21.38% 597.070us 597.070us 0.000us 0.00% 3.032ms 3.032ms 1 - aten::layer_norm 0.34% 9.401us 18.95% 529.300us 176.433us 0.000us 0.00% 3.032ms 1.011ms 3 - aten::native_layer_norm 1.84% 51.400us 18.61% 519.899us 173.300us 2.325ms 100.00% 3.032ms 1.011ms 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 2.327ms 100.06% 2.327ms 2.327ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 2.325ms 100.00% 2.325ms 775.112us 3 - Activity Buffer Request 9.90% 276.585us 9.90% 276.585us 276.585us 706.558us 30.39% 706.558us 706.558us 1 - aten::empty 1.09% 30.392us 1.09% 30.392us 3.377us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 5.64% 157.652us 5.64% 157.652us 52.551us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.14% 3.870us 0.14% 3.870us 0.645us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 78.62% 2.196ms 78.62% 2.196ms 2.196ms 0.000us 0.00% 0.000us 0.000us 1 + torch_layer_norm 1.17% 72.893us 26.00% 1.616ms 1.616ms 0.000us 0.00% 6.248ms 6.248ms 1 + aten::layer_norm 0.15% 9.290us 24.82% 1.543ms 514.468us 0.000us 0.00% 6.248ms 2.083ms 3 + aten::native_layer_norm 0.84% 52.403us 24.67% 1.534ms 511.371us 4.735ms 100.00% 6.248ms 2.083ms 3 + torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 4.736ms 100.03% 4.736ms 4.736ms 1 +void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 4.735ms 100.00% 4.735ms 1.578ms 3 + Activity Buffer Request 22.86% 1.421ms 22.86% 1.421ms 1.421ms 1.513ms 31.96% 1.513ms 1.513ms 1 + aten::empty 0.47% 29.320us 0.47% 29.320us 3.258us 0.000us 0.00% 0.000us 0.000us 9 + cudaLaunchKernel 0.43% 26.781us 0.43% 26.781us 8.927us 0.000us 0.00% 0.000us 0.000us 3 + aten::view 0.07% 4.140us 0.07% 4.140us 0.690us 0.000us 0.00% 0.000us 0.000us 6 + cudaDeviceSynchronize 74.00% 4.601ms 74.00% 4.601ms 4.601ms 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 2.793ms -Self CUDA time total: 2.325ms +Self CPU time total: 6.218ms +Self CUDA time total: 4.735ms ====================================================================== -PROFILE TRACE: torch_layer_norm | LN_B16_S2048_D8192 +PROFILE TRACE: torch_layer_norm | LN_B16_S4096_D8192 ====================================================================== ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - torch_layer_norm 1.28% 68.262us 10.71% 572.390us 572.390us 0.000us 0.00% 6.493ms 6.493ms 1 - aten::layer_norm 0.16% 8.770us 9.43% 504.128us 168.043us 0.000us 0.00% 6.493ms 2.164ms 3 - aten::native_layer_norm 0.96% 51.508us 9.27% 495.358us 165.119us 4.900ms 100.00% 6.493ms 2.164ms 3 - torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 4.901ms 100.03% 4.901ms 4.901ms 1 -void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 4.900ms 100.00% 4.900ms 1.633ms 3 - Activity Buffer Request 4.74% 253.634us 4.74% 253.634us 253.634us 1.594ms 32.53% 1.594ms 1.594ms 1 - aten::empty 0.56% 29.682us 0.56% 29.682us 3.298us 0.000us 0.00% 0.000us 0.000us 9 - cudaLaunchKernel 2.93% 156.523us 2.93% 156.523us 52.174us 0.000us 0.00% 0.000us 0.000us 3 - aten::view 0.08% 4.011us 0.08% 4.011us 0.669us 0.000us 0.00% 0.000us 0.000us 6 - cudaDeviceSynchronize 89.29% 4.774ms 89.29% 4.774ms 4.774ms 0.000us 0.00% 0.000us 0.000us 1 + torch_layer_norm 0.66% 74.633us 14.54% 1.650ms 1.650ms 0.000us 0.00% 13.090ms 13.090ms 1 + aten::layer_norm 0.09% 9.800us 13.88% 1.575ms 525.028us 0.000us 0.00% 13.090ms 4.363ms 3 + aten::native_layer_norm 0.45% 51.390us 13.79% 1.565ms 521.762us 9.838ms 100.00% 13.090ms 4.363ms 3 + torch_layer_norm 0.00% 0.000us 0.00% 0.000us 0.000us 9.839ms 100.01% 9.839ms 9.839ms 1 +void at::native::(anonymous namespace)::vectorized_l... 0.00% 0.000us 0.00% 0.000us 0.000us 9.838ms 100.00% 9.838ms 3.279ms 3 + Activity Buffer Request 11.36% 1.289ms 11.36% 1.289ms 1.289ms 3.253ms 33.06% 3.253ms 3.253ms 1 + aten::empty 0.28% 31.381us 0.28% 31.381us 3.487us 0.000us 0.00% 0.000us 0.000us 9 + cudaLaunchKernel 1.67% 189.088us 1.67% 189.088us 63.029us 0.000us 0.00% 0.000us 0.000us 3 + aten::view 0.04% 4.121us 0.04% 4.121us 0.687us 0.000us 0.00% 0.000us 0.000us 6 + cudaDeviceSynchronize 85.46% 9.697ms 85.46% 9.697ms 9.697ms 0.000us 0.00% 0.000us 0.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 5.346ms -Self CUDA time total: 4.900ms +Self CPU time total: 11.347ms +Self CUDA time total: 9.838ms impl wl p50(ms) ok -torch_layer_norm LN_B16_S1024_D1024 0.05 False -torch_layer_norm LN_B16_S1024_D2048 0.21 False -torch_layer_norm LN_B16_S1024_D4096 0.42 False -torch_layer_norm LN_B16_S1024_D8192 0.85 False -torch_layer_norm LN_B16_S128_D1024 0.03 False -torch_layer_norm LN_B16_S128_D2048 0.03 False -torch_layer_norm LN_B16_S128_D4096 0.04 False -torch_layer_norm LN_B16_S128_D8192 0.05 False -torch_layer_norm LN_B16_S2048_D1024 0.21 False -torch_layer_norm LN_B16_S2048_D2048 0.42 False -torch_layer_norm LN_B16_S2048_D4096 0.82 False -torch_layer_norm LN_B16_S2048_D8192 1.68 False -torch_layer_norm LN_B16_S512_D1024 0.04 False -torch_layer_norm LN_B16_S512_D2048 0.05 False -torch_layer_norm LN_B16_S512_D4096 0.21 False -torch_layer_norm LN_B16_S512_D8192 0.43 False -torch_layer_norm LN_B1_S1024_D1024 0.03 False -torch_layer_norm LN_B1_S1024_D2048 0.03 False -torch_layer_norm LN_B1_S1024_D4096 0.03 False -torch_layer_norm LN_B1_S1024_D8192 0.04 False -torch_layer_norm LN_B1_S128_D1024 0.02 False -torch_layer_norm LN_B1_S128_D2048 0.03 False -torch_layer_norm LN_B1_S128_D4096 0.03 False -torch_layer_norm LN_B1_S128_D8192 0.03 False -torch_layer_norm LN_B1_S2048_D1024 0.03 False -torch_layer_norm LN_B1_S2048_D2048 0.03 False -torch_layer_norm LN_B1_S2048_D4096 0.04 False -torch_layer_norm LN_B1_S2048_D8192 0.05 False -torch_layer_norm LN_B1_S512_D1024 0.03 False -torch_layer_norm LN_B1_S512_D2048 0.03 False -torch_layer_norm LN_B1_S512_D4096 0.03 False -torch_layer_norm LN_B1_S512_D8192 0.03 False -torch_layer_norm LN_B4_S1024_D1024 0.03 False -torch_layer_norm LN_B4_S1024_D2048 0.04 False -torch_layer_norm LN_B4_S1024_D4096 0.05 False -torch_layer_norm LN_B4_S1024_D8192 0.20 False -torch_layer_norm LN_B4_S128_D1024 0.03 False -torch_layer_norm LN_B4_S128_D2048 0.03 False -torch_layer_norm LN_B4_S128_D4096 0.03 False -torch_layer_norm LN_B4_S128_D8192 0.03 False -torch_layer_norm LN_B4_S2048_D1024 0.04 False -torch_layer_norm LN_B4_S2048_D2048 0.05 False -torch_layer_norm LN_B4_S2048_D4096 0.21 False -torch_layer_norm LN_B4_S2048_D8192 0.44 False -torch_layer_norm LN_B4_S512_D1024 0.03 False -torch_layer_norm LN_B4_S512_D2048 0.03 False -torch_layer_norm LN_B4_S512_D4096 0.04 False -torch_layer_norm LN_B4_S512_D8192 0.05 False +torch_layer_norm LN_B16_S2048_D4096 0.82 True +torch_layer_norm LN_B16_S2048_D8192 1.68 True +torch_layer_norm LN_B16_S4096_D4096 1.61 True +torch_layer_norm LN_B16_S4096_D8192 3.33 True▶ UV Install Logs