+

HF Kernels - Deformable DETR

+

GPU Info

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: nv | 0.23s + | + +Raw +GitHub +🤗 HF +
+
+
+
import subprocess
+print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
+
+ +
+
+
+
+
Fri Oct 31 20:13:34 2025       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
+| N/A   43C    P0             83W /  350W |       0MiB /  46068MiB |     60%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+
+
+
+ +

Deformable DETR Multi-Scale Deformable Attention Benchmark

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: benchmark | 8.30s + | + +Raw +GitHub +🤗 HF +
+
+
+
# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "numpy",
+#     "torch==2.8.0",
+#     "kernels-benchmark-tools",
+#     "kernels",
+# ]
+#
+# [tool.uv.sources]
+# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
+# ///
+import torch
+import sys
+from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
+from kernels import get_kernel
+
+# Load the deformable DETR kernel
+deformable_detr = get_kernel("kernels-community/deformable-detr")
+
+
+def hf_kernels_deformable_detr(
+    value, spatial_shapes, level_start_index, sampling_locations, attention_weights, im2col_step=64
+):
+    """HuggingFace Kernels Deformable DETR Multi-Scale Deformable Attention"""
+    return deformable_detr.ms_deform_attn_forward(
+        value=value,
+        spatial_shapes=spatial_shapes,
+        level_start_index=level_start_index,
+        sampling_loc=sampling_locations,
+        attn_weight=attention_weights,
+        im2col_step=im2col_step
+    )
+
+
+run_benchmark(
+    kernel_type=KernelTypeEnum.DEFORMABLE_DETR,
+    impl_name="hf_kernels_deformable_detr",
+    impl_tags={"family": "hf-kernels", "backend": "cuda"},
+    impl_func=hf_kernels_deformable_detr,
+    dtype="float32",
+)
+
+ +
+
+
+
+
Running deformable_detr benchmark on cuda with 4 workloads.
+
+======================================================================
+PROFILE TRACE: hf_kernels_deformable_detr | cuda_B1_Q100_H8_E256_L4_P4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     195.201us       770.15%     195.201us     195.201us             1  
+                             hf_kernels_deformable_detr         7.43%     141.524us        99.61%       1.898ms       1.898ms       0.000us         0.00%      26.403us      26.403us             1  
+       _deformable_detr_57c3d32::ms_deform_attn_forward         3.93%      74.960us        92.19%       1.756ms     585.455us      22.464us        88.63%      26.403us       8.801us             3  
+void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      22.464us        88.63%      22.464us       7.488us             3  
+                                            aten::zeros         1.20%      22.800us        85.08%       1.621ms     540.337us       0.000us         0.00%       3.939us       1.313us             3  
+                                            aten::zero_         0.89%      16.910us        82.13%       1.565ms     521.590us       0.000us         0.00%       3.939us       1.313us             3  
+                                            aten::fill_         1.72%      32.820us        81.24%       1.548ms     515.953us       2.882us        11.37%       3.939us       1.313us             3  
+void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.882us        11.37%       2.882us       0.961us             3  
+                                Activity Buffer Request        77.24%       1.472ms        77.24%       1.472ms       1.472ms       1.057us         4.17%       1.057us       1.057us             1  
+                                            aten::empty         1.76%      33.441us         1.76%      33.441us      11.147us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         3.19%      60.842us         3.19%      60.842us      10.140us       0.000us         0.00%       0.000us       0.000us             6  
+                                             aten::view         0.89%      16.922us         0.89%      16.922us       2.820us       0.000us         0.00%       0.000us       0.000us             6  
+                                           aten::select         1.13%      21.591us         1.37%      26.081us       8.694us       0.000us         0.00%       0.000us       0.000us             3  
+                                       aten::as_strided         0.24%       4.490us         0.24%       4.490us       1.497us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.39%       7.340us         0.39%       7.340us       7.340us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.905ms
+Self CUDA time total: 25.346us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_deformable_detr | cuda_B1_Q300_H8_E256_L4_P4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     144.191us       546.22%     144.191us     144.191us             1  
+                             hf_kernels_deformable_detr         4.39%      75.912us        99.67%       1.722ms       1.722ms       0.000us         0.00%      27.358us      27.358us             1  
+       _deformable_detr_57c3d32::ms_deform_attn_forward         2.01%      34.700us        95.28%       1.646ms     548.647us      23.550us        89.21%      27.358us       9.119us             3  
+void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      23.550us        89.21%      23.550us       7.850us             3  
+                                            aten::zeros         0.49%       8.451us        91.07%       1.573ms     524.424us       0.000us         0.00%       3.808us       1.269us             3  
+                                            aten::zero_         0.50%       8.669us        89.54%       1.547ms     515.616us       0.000us         0.00%       3.808us       1.269us             3  
+                                            aten::fill_         1.60%      27.701us        89.04%       1.538ms     512.727us       2.848us        10.79%       3.808us       1.269us             3  
+void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.848us        10.79%       2.848us       0.949us             3  
+                                Activity Buffer Request        85.90%       1.484ms        85.90%       1.484ms       1.484ms       0.960us         3.64%       0.960us       0.960us             1  
+                                            aten::empty         1.04%      17.971us         1.04%      17.971us       5.990us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         2.40%      41.442us         2.40%      41.442us       6.907us       0.000us         0.00%       0.000us       0.000us             6  
+                                             aten::view         0.54%       9.400us         0.54%       9.400us       1.567us       0.000us         0.00%       0.000us       0.000us             6  
+                                           aten::select         0.66%      11.329us         0.79%      13.720us       4.573us       0.000us         0.00%       0.000us       0.000us             3  
+                                       aten::as_strided         0.14%       2.391us         0.14%       2.391us       0.797us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.33%       5.680us         0.33%       5.680us       5.680us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.728ms
+Self CUDA time total: 26.398us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_deformable_detr | cuda_B2_Q100_H8_E256_L4_P4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     140.288us       549.37%     140.288us     140.288us             1  
+                             hf_kernels_deformable_detr         4.34%      74.492us        99.67%       1.709ms       1.709ms       0.000us         0.00%      26.464us      26.464us             1  
+       _deformable_detr_57c3d32::ms_deform_attn_forward         1.96%      33.680us        95.32%       1.635ms     544.984us      22.752us        89.10%      26.464us       8.821us             3  
+void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      22.752us        89.10%      22.752us       7.584us             3  
+                                            aten::zeros         0.50%       8.650us        91.19%       1.564ms     521.367us       0.000us         0.00%       3.712us       1.237us             3  
+                                            aten::zero_         0.47%       8.130us        89.69%       1.538ms     512.773us       0.000us         0.00%       3.712us       1.237us             3  
+                                            aten::fill_         1.63%      27.881us        89.21%       1.530ms     510.063us       2.784us        10.90%       3.712us       1.237us             3  
+void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.784us        10.90%       2.784us       0.928us             3  
+                                Activity Buffer Request        86.04%       1.476ms        86.04%       1.476ms       1.476ms       0.928us         3.63%       0.928us       0.928us             1  
+                                            aten::empty         1.00%      17.131us         1.00%      17.131us       5.710us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         2.42%      41.510us         2.42%      41.510us       6.918us       0.000us         0.00%       0.000us       0.000us             6  
+                                             aten::view         0.52%       8.991us         0.52%       8.991us       1.498us       0.000us         0.00%       0.000us       0.000us             6  
+                                           aten::select         0.62%      10.681us         0.77%      13.291us       4.430us       0.000us         0.00%       0.000us       0.000us             3  
+                                       aten::as_strided         0.15%       2.610us         0.15%       2.610us       0.870us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.33%       5.730us         0.33%       5.730us       5.730us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.715ms
+Self CUDA time total: 25.536us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_deformable_detr | cuda_B2_Q300_H8_E256_L4_P4
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                             hf_kernels_deformable_detr         0.00%       0.000us         0.00%       0.000us       0.000us     151.934us       322.76%     151.934us     151.934us             1  
+                             hf_kernels_deformable_detr         3.86%      74.313us        99.75%       1.919ms       1.919ms       0.000us         0.00%      48.129us      48.129us             1  
+       _deformable_detr_57c3d32::ms_deform_attn_forward         1.79%      34.420us        95.88%       1.844ms     614.769us      43.968us        93.40%      48.129us      16.043us             3  
+void ms_deformable_im2col_gpu_kernel<float>(int, flo...         0.00%       0.000us         0.00%       0.000us       0.000us      43.968us        93.40%      43.968us      14.656us             3  
+                                            aten::zeros         0.45%       8.600us        92.03%       1.770ms     590.092us       0.000us         0.00%       4.161us       1.387us             3  
+                                            aten::zero_         0.45%       8.690us        90.72%       1.745ms     581.642us       0.000us         0.00%       4.161us       1.387us             3  
+                                            aten::fill_         1.44%      27.641us        90.26%       1.736ms     578.745us       3.105us         6.60%       4.161us       1.387us             3  
+void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       3.105us         6.60%       3.105us       1.035us             3  
+                                Activity Buffer Request        76.84%       1.478ms        76.84%       1.478ms       1.478ms       1.056us         2.24%       1.056us       1.056us             1  
+                                            aten::empty         0.87%      16.750us         0.87%      16.750us       5.583us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        12.74%     245.037us        12.74%     245.037us      40.839us       0.000us         0.00%       0.000us       0.000us             6  
+                                             aten::view         0.49%       9.420us         0.49%       9.420us       1.570us       0.000us         0.00%       0.000us       0.000us             6  
+                                           aten::select         0.66%      12.781us         0.82%      15.781us       5.260us       0.000us         0.00%       0.000us       0.000us             3  
+                                       aten::as_strided         0.16%       3.000us         0.16%       3.000us       1.000us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.25%       4.890us         0.25%       4.890us       4.890us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.924ms
+Self CUDA time total: 47.073us
+
+
+impl                     wl                  p50(ms)  ok
+hf_kernels_deformable_detr cuda_B1_Q100_H8_E256_L4_P4     0.04  True
+hf_kernels_deformable_detr cuda_B1_Q300_H8_E256_L4_P4     0.05  True
+hf_kernels_deformable_detr cuda_B2_Q100_H8_E256_L4_P4     0.05  True
+hf_kernels_deformable_detr cuda_B2_Q300_H8_E256_L4_P4     0.05  True
+
+
+
▶ UV Install Logs
+ +
+
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] +Fetching 7 files: 14%|█▍ | 1/7 [00:00<00:00, 6.20it/s] +Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 9.26it/s] +Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 12.59it/s]
+
+

Artifacts:

+deformable_detr.jsonl +
+
+
+