+

HF Kernels - Causal Conv1D

+

GPU Info

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: nv | 0.21s + | + +Raw +GitHub +
+
+
+
import subprocess
+print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
+
+ +
+
+
+
+
Tue Oct 28 14:08:09 2025       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
+| N/A   28C    P0             80W /  350W |       0MiB /  46068MiB |     19%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+
+
+
+ +

Causal Conv1D Benchmark

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: benchmark | 9.91s + | + +Raw +GitHub +
+
+
+
# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "numpy",
+#     "torch==2.8.0",
+#     "kernels-benchmark-tools",
+#     "kernels",
+# ]
+#
+# [tool.uv.sources]
+# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
+# ///
+import torch
+import sys
+from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
+from kernels import get_kernel
+
+# Load the causal conv1d kernel
+causal_conv1d = get_kernel("kernels-community/causal-conv1d")
+
+
+def hf_kernels_causal_conv1d(input_tensor, weight, bias):
+    return causal_conv1d.causal_conv1d_fn(input_tensor, weight, bias)
+
+
+run_benchmark(
+    kernel_type=KernelTypeEnum.CAUSAL_CONV1D,
+    impl_name="hf_kernels_causal_conv1d",
+    impl_tags={"family": "hf-kernels", "backend": "cuda"},
+    impl_func=hf_kernels_causal_conv1d,
+)
+
+ +
+
+
+
+
Running causal_conv1d benchmark on cuda with 24 workloads.
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     153.312us      3772.44%     153.312us     153.312us             1  
+                               hf_kernels_causal_conv1d         8.26%     153.696us        99.59%       1.854ms       1.854ms       0.000us         0.00%       5.504us       5.504us             1  
+                                         CausalConv1dFn         6.06%     112.844us        91.33%       1.700ms     566.616us       0.000us         0.00%       5.504us       1.835us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.41%      26.281us        81.37%       1.514ms     504.821us       4.064us       100.00%       5.504us       1.835us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.064us       100.00%       4.064us       1.355us             3  
+                                Activity Buffer Request        77.27%       1.438ms        77.27%       1.438ms       1.438ms       1.440us        35.43%       1.440us       1.440us             1  
+                                       aten::empty_like         1.15%      21.339us         3.90%      72.543us      24.181us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         2.75%      51.204us         2.75%      51.204us      17.068us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         2.69%      50.001us         2.69%      50.001us      16.667us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.41%       7.700us         0.41%       7.700us       7.700us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.861ms
+Self CUDA time total: 4.064us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S128_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     128.895us      3412.63%     128.895us     128.895us             1  
+                               hf_kernels_causal_conv1d         5.00%      84.832us        99.68%       1.692ms       1.692ms       0.000us         0.00%       5.026us       5.026us             1  
+                                         CausalConv1dFn         4.43%      75.123us        94.68%       1.607ms     535.685us       0.000us         0.00%       5.026us       1.675us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.59%      27.059us        88.41%       1.501ms     500.224us       3.777us       100.00%       5.026us       1.675us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.777us       100.00%       3.777us       1.259us             3  
+                                Activity Buffer Request        84.88%       1.441ms        84.88%       1.441ms       1.441ms       1.249us        33.07%       1.249us       1.249us             1  
+                                       aten::empty_like         0.54%       9.230us         1.84%      31.262us      10.421us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.30%      22.032us         1.30%      22.032us       7.344us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         1.94%      32.892us         1.94%      32.892us      10.964us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.32%       5.440us         0.32%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.697ms
+Self CUDA time total: 3.777us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.670us      3273.90%     124.670us     124.670us             1  
+                               hf_kernels_causal_conv1d         4.86%      81.824us        99.65%       1.679ms       1.679ms       0.000us         0.00%       5.056us       5.056us             1  
+                                         CausalConv1dFn         4.28%      72.081us        94.80%       1.598ms     532.512us       0.000us         0.00%       5.056us       1.685us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.53%      25.732us        88.63%       1.494ms     497.871us       3.808us       100.00%       5.056us       1.685us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.808us       100.00%       3.808us       1.269us             3  
+                                Activity Buffer Request        85.15%       1.435ms        85.15%       1.435ms       1.435ms       1.248us        32.77%       1.248us       1.248us             1  
+                                       aten::empty_like         0.59%       9.910us         1.89%      31.841us      10.614us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.30%      21.931us         1.30%      21.931us       7.310us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         1.96%      32.960us         1.96%      32.960us      10.987us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.35%       5.830us         0.35%       5.830us       5.830us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.685ms
+Self CUDA time total: 3.808us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S512_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     131.358us      3479.68%     131.358us     131.358us             1  
+                               hf_kernels_causal_conv1d         4.44%      83.422us        99.71%       1.875ms       1.875ms       0.000us         0.00%       5.054us       5.054us             1  
+                                         CausalConv1dFn         4.02%      75.643us        95.28%       1.792ms     597.348us       0.000us         0.00%       5.054us       1.685us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.36%      25.501us        89.54%       1.684ms     561.363us       3.775us       100.00%       5.054us       1.685us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.775us       100.00%       3.775us       1.258us             3  
+                                Activity Buffer Request        75.66%       1.423ms        75.66%       1.423ms       1.423ms       1.279us        33.88%       1.279us       1.279us             1  
+                                       aten::empty_like         0.55%      10.279us         1.72%      32.311us      10.770us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.17%      22.032us         1.17%      22.032us       7.344us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        12.52%     235.449us        12.52%     235.449us      78.483us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.29%       5.400us         0.29%       5.400us       5.400us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.881ms
+Self CUDA time total: 3.775us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.694us      2701.96%     129.694us     129.694us             1  
+                               hf_kernels_causal_conv1d         4.57%      82.923us        99.70%       1.809ms       1.809ms       0.000us         0.00%       6.432us       6.432us             1  
+                                         CausalConv1dFn         4.25%      77.065us        95.13%       1.727ms     575.517us       0.000us         0.00%       6.432us       2.144us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.43%      25.889us        89.13%       1.618ms     539.172us       4.800us       100.00%       6.432us       2.144us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.800us       100.00%       4.800us       1.600us             3  
+                                Activity Buffer Request        78.67%       1.428ms        78.67%       1.428ms       1.428ms       1.632us        34.00%       1.632us       1.632us             1  
+                                       aten::empty_like         0.53%       9.690us         1.76%      31.970us      10.657us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.23%      22.280us         1.23%      22.280us       7.427us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         9.03%     163.837us         9.03%     163.837us      54.612us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.30%       5.391us         0.30%       5.391us       5.391us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.815ms
+Self CUDA time total: 4.800us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D64_S2048_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.655us      2439.95%     118.655us     118.655us             1  
+                               hf_kernels_causal_conv1d        15.62%      77.102us        98.87%     488.177us     488.177us       0.000us         0.00%       6.495us       6.495us             1  
+                                         CausalConv1dFn        14.62%      72.193us        83.25%     411.075us     137.025us       0.000us         0.00%       6.495us       2.165us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.27%      26.040us        62.53%     308.751us     102.917us       4.863us       100.00%       6.495us       2.165us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.863us       100.00%       4.863us       1.621us             3  
+                                Activity Buffer Request        25.28%     124.815us        25.28%     124.815us     124.815us       1.632us        33.56%       1.632us       1.632us             1  
+                                       aten::empty_like         1.61%       7.949us         6.10%      30.131us      10.044us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.49%      22.182us         4.49%      22.182us       7.394us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        31.98%     157.896us        31.98%     157.896us      52.632us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.13%       5.580us         1.13%       5.580us       5.580us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 493.757us
+Self CUDA time total: 4.863us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     126.463us      1179.69%     126.463us     126.463us             1  
+                               hf_kernels_causal_conv1d         4.44%      79.793us        99.69%       1.793ms       1.793ms       0.000us         0.00%      14.304us      14.304us             1  
+                                         CausalConv1dFn         3.96%      71.252us        95.25%       1.713ms     571.037us       0.000us         0.00%      14.304us       4.768us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.37%      24.661us        89.51%       1.610ms     536.652us      10.720us       100.00%      14.304us       4.768us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.720us       100.00%      10.720us       3.573us             3  
+                                Activity Buffer Request        79.30%       1.426ms        79.30%       1.426ms       1.426ms       3.584us        33.43%       3.584us       3.584us             1  
+                                       aten::empty_like         0.54%       9.750us         1.77%      31.901us      10.634us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.23%      22.151us         1.23%      22.151us       7.384us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         8.84%     159.036us         8.84%     159.036us      53.012us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.31%       5.660us         0.31%       5.660us       5.660us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.799ms
+Self CUDA time total: 10.720us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S128_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.490us      1115.98%     122.490us     122.490us             1  
+                               hf_kernels_causal_conv1d        17.58%      82.141us        98.94%     462.145us     462.145us       0.000us         0.00%      14.656us      14.656us             1  
+                                         CausalConv1dFn        15.46%      72.195us        81.35%     380.004us     126.668us       0.000us         0.00%      14.656us       4.885us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.51%      25.720us        59.56%     278.229us      92.743us      10.976us       100.00%      14.656us       4.885us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      10.976us       100.00%      10.976us       3.659us             3  
+                                Activity Buffer Request        20.67%      96.553us        20.67%      96.553us      96.553us       3.680us        33.53%       3.680us       3.680us             1  
+                                       aten::empty_like         1.79%       8.340us         6.33%      29.580us       9.860us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.55%      21.240us         4.55%      21.240us       7.080us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        33.39%     155.956us        33.39%     155.956us      51.985us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.06%       4.970us         1.06%       4.970us       4.970us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 467.115us
+Self CUDA time total: 10.976us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     128.671us      1165.50%     128.671us     128.671us             1  
+                               hf_kernels_causal_conv1d         4.51%      81.351us        99.72%       1.798ms       1.798ms       0.000us         0.00%      14.784us      14.784us             1  
+                                         CausalConv1dFn         4.05%      73.093us        95.21%       1.717ms     572.174us       0.000us         0.00%      14.784us       4.928us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.34%      24.081us        89.39%       1.612ms     537.183us      11.040us       100.00%      14.784us       4.928us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.040us       100.00%      11.040us       3.680us             3  
+                                Activity Buffer Request        79.34%       1.430ms        79.34%       1.430ms       1.430ms       3.744us        33.91%       3.744us       3.744us             1  
+                                       aten::empty_like         0.49%       8.921us         1.77%      31.881us      10.627us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.27%      22.960us         1.27%      22.960us       7.653us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         8.72%     157.177us         8.72%     157.177us      52.392us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.28%       4.970us         0.28%       4.970us       4.970us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.803ms
+Self CUDA time total: 11.040us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S512_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.762us      1085.65%     125.762us     125.762us             1  
+                               hf_kernels_causal_conv1d        16.83%      79.002us        98.82%     463.887us     463.887us       0.000us         0.00%      15.360us      15.360us             1  
+                                         CausalConv1dFn        15.62%      73.323us        81.99%     384.885us     128.295us       0.000us         0.00%      15.360us       5.120us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.37%      25.230us        59.95%     281.430us      93.810us      11.584us       100.00%      15.360us       5.120us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      11.584us       100.00%      11.584us       3.861us             3  
+                                Activity Buffer Request        20.79%      97.593us        20.79%      97.593us      97.593us       3.776us        32.60%       3.776us       3.776us             1  
+                                       aten::empty_like         1.82%       8.531us         6.42%      30.132us      10.044us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.60%      21.601us         4.60%      21.601us       7.200us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        33.79%     158.607us        33.79%     158.607us      52.869us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.18%       5.530us         1.18%       5.530us       5.530us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 469.417us
+Self CUDA time total: 11.584us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     134.046us       264.80%     134.046us     134.046us             1  
+                               hf_kernels_causal_conv1d         4.19%      76.942us        99.71%       1.832ms       1.832ms       0.000us         0.00%      84.285us      84.285us             1  
+                                         CausalConv1dFn         4.10%      75.381us        95.52%       1.755ms     585.044us       0.000us         0.00%      84.285us      28.095us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.30%      23.952us        89.70%       1.648ms     549.413us      50.622us       100.00%      84.285us      28.095us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      50.622us       100.00%      50.622us      16.874us             3  
+                                Activity Buffer Request        78.71%       1.446ms        78.71%       1.446ms       1.446ms      33.663us        66.50%      33.663us      33.663us             1  
+                                       aten::empty_like         0.54%       9.991us         1.71%      31.512us      10.504us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.17%      21.521us         1.17%      21.521us       7.174us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         9.69%     177.966us         9.69%     177.966us      59.322us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.29%       5.380us         0.29%       5.380us       5.380us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.837ms
+Self CUDA time total: 50.622us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B2_D2048_S2048_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     124.639us       241.17%     124.639us     124.639us             1  
+                               hf_kernels_causal_conv1d        12.15%      73.652us        99.08%     600.632us     600.632us       0.000us         0.00%      86.272us      86.272us             1  
+                                         CausalConv1dFn        11.76%      71.283us        86.93%     526.980us     175.660us       0.000us         0.00%      86.272us      28.757us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.05%      24.580us        70.27%     425.965us     141.988us      51.680us       100.00%      86.272us      28.757us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      51.680us       100.00%      51.680us      17.227us             3  
+                                Activity Buffer Request        38.62%     234.139us        38.62%     234.139us     234.139us      34.592us        66.93%      34.592us      34.592us             1  
+                                       aten::empty_like         1.31%       7.952us         4.90%      29.732us       9.911us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         3.59%      21.780us         3.59%      21.780us       7.260us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        27.59%     167.246us        27.59%     167.246us      55.749us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.92%       5.560us         0.92%       5.560us       5.560us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 606.192us
+Self CUDA time total: 51.680us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.184us      3001.64%     117.184us     117.184us             1  
+                               hf_kernels_causal_conv1d        11.99%      71.634us        99.07%     591.661us     591.661us       0.000us         0.00%       5.152us       5.152us             1  
+                                         CausalConv1dFn        11.65%      69.552us        87.08%     520.027us     173.342us       0.000us         0.00%       5.152us       1.717us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.09%      24.400us        70.30%     419.834us     139.945us       3.904us       100.00%       5.152us       1.717us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.904us       100.00%       3.904us       1.301us             3  
+                                Activity Buffer Request        39.52%     236.029us        39.52%     236.029us     236.029us       1.248us        31.97%       1.248us       1.248us             1  
+                                       aten::empty_like         1.39%       8.281us         5.13%      30.641us      10.214us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         3.74%      22.360us         3.74%      22.360us       7.453us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        26.69%     159.405us        26.69%     159.405us      53.135us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.93%       5.550us         0.93%       5.550us       5.550us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 597.211us
+Self CUDA time total: 3.904us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S128_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     129.214us      3308.94%     129.214us     129.214us             1  
+                               hf_kernels_causal_conv1d        14.44%      74.841us        98.93%     512.678us     512.678us       0.000us         0.00%       5.154us       5.154us             1  
+                                         CausalConv1dFn        14.14%      73.283us        84.49%     437.837us     145.946us       0.000us         0.00%       5.154us       1.718us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         6.57%      34.031us        64.55%     334.472us     111.491us       3.905us       100.00%       5.154us       1.718us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       3.905us       100.00%       3.905us       1.302us             3  
+                                Activity Buffer Request        27.83%     144.225us        27.83%     144.225us     144.225us       1.249us        31.98%       1.249us       1.249us             1  
+                                       aten::empty_like         1.69%       8.750us         5.81%      30.082us      10.027us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.12%      21.332us         4.12%      21.332us       7.111us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        30.15%     156.216us        30.15%     156.216us      52.072us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.07%       5.520us         1.07%       5.520us       5.520us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 518.198us
+Self CUDA time total: 3.905us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     118.525us      2939.61%     118.525us     118.525us             1  
+                               hf_kernels_causal_conv1d        13.97%      75.404us        99.13%     534.960us     534.960us       0.000us         0.00%       5.376us       5.376us             1  
+                                         CausalConv1dFn        13.10%      70.683us        85.16%     459.556us     153.185us       0.000us         0.00%       5.376us       1.792us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.73%      25.549us        66.42%     358.442us     119.481us       4.032us       100.00%       5.376us       1.792us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.032us       100.00%       4.032us       1.344us             3  
+                                Activity Buffer Request        32.81%     177.046us        32.81%     177.046us     177.046us       1.344us        33.33%       1.344us       1.344us             1  
+                                       aten::empty_like         1.62%       8.721us         5.64%      30.431us      10.144us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.02%      21.710us         4.02%      21.710us       7.237us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        28.88%     155.847us        28.88%     155.847us      51.949us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.87%       4.710us         0.87%       4.710us       4.710us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 539.670us
+Self CUDA time total: 4.032us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S512_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     115.905us      2852.70%     115.905us     115.905us             1  
+                               hf_kernels_causal_conv1d        16.16%      74.143us        98.83%     453.315us     453.315us       0.000us         0.00%       5.407us       5.407us             1  
+                                         CausalConv1dFn        14.93%      68.471us        82.67%     379.172us     126.391us       0.000us         0.00%       5.407us       1.802us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.63%      25.811us        61.32%     281.280us      93.760us       4.063us       100.00%       5.407us       1.802us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       4.063us       100.00%       4.063us       1.354us             3  
+                                Activity Buffer Request        21.83%     100.113us        21.83%     100.113us     100.113us       1.344us        33.08%       1.344us       1.344us             1  
+                                       aten::empty_like         1.88%       8.641us         6.41%      29.421us       9.807us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.53%      20.780us         4.53%      20.780us       6.927us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        33.87%     155.356us        33.87%     155.356us      51.785us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.17%       5.370us         1.17%       5.370us       5.370us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 458.685us
+Self CUDA time total: 4.063us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.141us      2271.97%     122.141us     122.141us             1  
+                               hf_kernels_causal_conv1d        11.82%      75.911us        99.15%     636.712us     636.712us       0.000us         0.00%       7.200us       7.200us             1  
+                                         CausalConv1dFn        11.01%      70.722us        87.33%     560.801us     186.934us       0.000us         0.00%       7.200us       2.400us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.24%      27.210us        71.66%     460.136us     153.379us       5.376us       100.00%       7.200us       2.400us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.376us       100.00%       5.376us       1.792us             3  
+                                Activity Buffer Request        43.06%     276.540us        43.06%     276.540us     276.540us       1.824us        33.93%       1.824us       1.824us             1  
+                                       aten::empty_like         1.25%       8.002us         4.66%      29.943us       9.981us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         3.42%      21.941us         3.42%      21.941us       7.314us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        24.35%     156.386us        24.35%     156.386us      52.129us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.85%       5.440us         0.85%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 642.152us
+Self CUDA time total: 5.376us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D64_S2048_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     117.822us      2140.66%     117.822us     117.822us             1  
+                               hf_kernels_causal_conv1d        16.30%      72.964us        98.80%     442.326us     442.326us       0.000us         0.00%       7.392us       7.392us             1  
+                                         CausalConv1dFn        16.19%      72.472us        82.50%     369.362us     123.121us       0.000us         0.00%       7.392us       2.464us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.63%      25.211us        59.71%     267.319us      89.106us       5.504us       100.00%       7.392us       2.464us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us       5.504us       100.00%       5.504us       1.835us             3  
+                                Activity Buffer Request        19.35%      86.632us        19.35%      86.632us      86.632us       1.888us        34.30%       1.888us       1.888us             1  
+                                       aten::empty_like         1.85%       8.281us         6.60%      29.571us       9.857us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.76%      21.290us         4.76%      21.290us       7.097us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        34.73%     155.476us        34.73%     155.476us      51.825us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.20%       5.391us         1.20%       5.391us       5.391us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 447.717us
+Self CUDA time total: 5.504us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     125.728us       716.97%     125.728us     125.728us             1  
+                               hf_kernels_causal_conv1d        11.80%      75.821us        99.14%     637.002us     637.002us       0.000us         0.00%      23.392us      23.392us             1  
+                                         CausalConv1dFn        11.24%      72.243us        87.34%     561.181us     187.060us       0.000us         0.00%      23.392us       7.797us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         4.08%      26.210us        71.24%     457.746us     152.582us      17.536us       100.00%      23.392us       7.797us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.536us       100.00%      17.536us       5.845us             3  
+                                Activity Buffer Request        42.92%     275.770us        42.92%     275.770us     275.770us       5.856us        33.39%       5.856us       5.856us             1  
+                                       aten::empty_like         1.45%       9.311us         4.85%      31.192us      10.397us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         3.41%      21.881us         3.41%      21.881us       7.294us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        24.24%     155.766us        24.24%     155.766us      51.922us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.86%       5.550us         0.86%       5.550us       5.550us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 642.552us
+Self CUDA time total: 17.536us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S128_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     123.901us       690.22%     123.901us     123.901us             1  
+                               hf_kernels_causal_conv1d        16.99%      75.711us        98.78%     440.245us     440.245us       0.000us         0.00%      23.967us      23.967us             1  
+                                         CausalConv1dFn        15.81%      70.471us        81.79%     364.534us     121.511us       0.000us         0.00%      23.967us       7.989us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.65%      25.192us        59.40%     264.751us      88.250us      17.951us       100.00%      23.967us       7.989us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      17.951us       100.00%      17.951us       5.984us             3  
+                                Activity Buffer Request        18.53%      82.593us        18.53%      82.593us      82.593us       6.016us        33.51%       6.016us       6.016us             1  
+                                       aten::empty_like         1.75%       7.802us         6.58%      29.312us       9.771us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.83%      21.510us         4.83%      21.510us       7.170us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        35.22%     156.966us        35.22%     156.966us      52.322us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.22%       5.440us         1.22%       5.440us       5.440us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 445.685us
+Self CUDA time total: 17.951us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     131.804us       730.34%     131.804us     131.804us             1  
+                               hf_kernels_causal_conv1d        11.57%      77.592us        99.18%     665.133us     665.133us       0.000us         0.00%      24.094us      24.094us             1  
+                                         CausalConv1dFn        10.93%      73.321us        87.61%     587.541us     195.847us       0.000us         0.00%      24.094us       8.031us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         3.40%      22.811us        71.94%     482.478us     160.826us      18.047us       100.00%      24.094us       8.031us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      18.047us       100.00%      18.047us       6.016us             3  
+                                Activity Buffer Request        44.54%     298.731us        44.54%     298.731us     298.731us       6.047us        33.51%       6.047us       6.047us             1  
+                                       aten::empty_like         1.35%       9.049us         4.73%      31.742us      10.581us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         3.38%      22.693us         3.38%      22.693us       7.564us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        24.00%     160.936us        24.00%     160.936us      53.645us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.82%       5.510us         0.82%       5.510us       5.510us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 670.643us
+Self CUDA time total: 18.047us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S512_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     122.267us       637.87%     122.267us     122.267us             1  
+                               hf_kernels_causal_conv1d        16.94%      75.003us        98.82%     437.665us     437.665us       0.000us         0.00%      25.632us      25.632us             1  
+                                         CausalConv1dFn        15.90%      70.409us        81.89%     362.662us     120.887us       0.000us         0.00%      25.632us       8.544us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.97%      26.462us        59.15%     261.981us      87.327us      19.168us       100.00%      25.632us       8.544us             3  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      19.168us       100.00%      19.168us       6.389us             3  
+                                Activity Buffer Request        18.04%      79.883us        18.04%      79.883us      79.883us       6.464us        33.72%       6.464us       6.464us             1  
+                                       aten::empty_like         2.06%       9.102us         6.84%      30.272us      10.091us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.78%      21.170us         4.78%      21.170us       7.057us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        35.14%     155.636us        35.14%     155.636us      51.879us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.18%       5.220us         1.18%       5.220us       5.220us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 442.885us
+Self CUDA time total: 19.168us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W2
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d         4.25%      77.621us        99.69%       1.822ms       1.822ms       0.000us         0.00%     163.007us     163.007us             1  
+                                         CausalConv1dFn         4.18%      76.374us        95.44%       1.744ms     581.328us       0.000us         0.00%     163.007us      54.336us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         1.34%      24.550us        89.50%       1.636ms     545.169us      97.983us       100.00%     163.007us      54.336us             3  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     142.719us       145.66%     142.719us     142.719us             1  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      97.983us       100.00%      97.983us      32.661us             3  
+                                Activity Buffer Request        79.33%       1.450ms        79.33%       1.450ms       1.450ms      65.024us        66.36%      65.024us      65.024us             1  
+                                       aten::empty_like         0.51%       9.271us         1.76%      32.102us      10.701us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         1.25%      22.831us         1.25%      22.831us       7.610us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         8.83%     161.275us         8.83%     161.275us      53.758us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.31%       5.740us         0.31%       5.740us       5.740us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.827ms
+Self CUDA time total: 97.983us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_causal_conv1d | cuda_B4_D2048_S2048_W4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                               hf_kernels_causal_conv1d        17.00%      78.131us        98.89%     454.476us     454.476us       0.000us         0.00%     164.440us     164.440us             1  
+                                         CausalConv1dFn        15.89%      73.024us        81.89%     376.345us     125.448us       0.000us         0.00%     164.440us      54.813us             3  
+              _causal_conv1d_90f5a60::causal_conv1d_fwd         5.76%      26.451us        59.63%     274.060us      91.353us      98.939us       100.00%     164.440us      54.813us             3  
+                               hf_kernels_causal_conv1d         0.00%       0.000us         0.00%       0.000us       0.000us     139.130us       140.62%     139.130us     139.130us             1  
+void causal_conv1d_fwd_kernel<Causal_conv1d_fwd_kern...         0.00%       0.000us         0.00%       0.000us       0.000us      98.939us       100.00%      98.939us      32.980us             3  
+                                Activity Buffer Request        18.20%      83.643us        18.20%      83.643us      83.643us      65.501us        66.20%      65.501us      65.501us             1  
+                                       aten::empty_like         1.75%       8.030us         6.37%      29.261us       9.754us       0.000us         0.00%       0.000us       0.000us             3  
+                                    aten::empty_strided         4.62%      21.231us         4.62%      21.231us       7.077us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        35.68%     163.966us        35.68%     163.966us      54.655us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.11%       5.111us         1.11%       5.111us       5.111us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 459.587us
+Self CUDA time total: 98.939us
+
+
+impl                     wl                  p50(ms)  ok
+hf_kernels_causal_conv1d cuda_B2_D2048_S128_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D2048_S128_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D2048_S2048_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D2048_S512_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D2048_S512_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D64_S128_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D64_S128_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D64_S2048_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D64_S2048_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D64_S512_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B2_D64_S512_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D2048_S128_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D2048_S128_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D2048_S2048_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D2048_S512_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D2048_S512_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D64_S128_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D64_S128_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D64_S2048_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D64_S2048_W4     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D64_S512_W2     0.05  True
+hf_kernels_causal_conv1d cuda_B4_D64_S512_W4     0.05  True
+
+
+
▶ UV Install Logs
+ +
+
Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s] +Fetching 11 files: 64%|██████▎ | 7/11 [00:02<00:01, 3.26it/s] +Fetching 11 files: 100%|██████████| 11/11 [00:02<00:00, 5.12it/s]
+
+

Artifacts:

+causal_conv1d.jsonl +
+
+
+