| --- |
| tags: |
| - kernel |
| --- |
| |
| # kernel_image_resize |
|
|
| A pure-Triton Hub kernel that fuses the **resize + rescale + normalize** preprocessing |
| pipeline run by ~150 `transformers` fast image processors (`TorchvisionBackend`: resize → |
| fold(rescale, normalize)) into a single GPU pass. It takes raw CHW `uint8` images and |
| returns the normalized `(N, C, out_h, out_w)` float tensor with no intermediate |
| full-resolution float buffer. |
|
|
| On a ragged SigLIP-so400m batch (A100, N=32, inputs 384–1024², out 384², bicubic+antialias) |
| the default backend runs in **1.29 ms/iter vs 3.90 ms for the fast processor (~3× faster)** |
| and 2.89 ms for torchvision's own per-image loop, at parity ≤1e-4 vs torchvision-float. |
|
|
| It ships as a `kernels` universal build variant (no compiled extension, just Triton), so it |
| loads on any CUDA PyTorch build via `get_kernel`. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from kernels import get_kernel |
| |
| kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True) |
| |
| # a list of different-H×W uint8 CHW images (the ragged case torchvision loops over) |
| images = [torch.randint(0, 256, (3, h, w), dtype=torch.uint8, device="cuda") |
| for h, w in [(640, 480), (800, 600), (384, 1024)]] |
| |
| pixel_values = kir.resize_normalize( |
| images, |
| size=384, # int (square), (H, W), or {"height", "width"} |
| image_mean=[0.5, 0.5, 0.5], |
| image_std=[0.5, 0.5, 0.5], |
| rescale_factor=1 / 255, |
| resample="bicubic", # or "bilinear", or a PIL resample int |
| antialias=True, # match the ViT/CLIP/SigLIP default |
| ) |
| # -> (3, 3, 384, 384) float32, ready for the model |
| ``` |
|
|
| Requires `kernels >= 0.15` (published as a `kernel` repo type). `trust_remote_code=True` is needed |
| because `Molbap` is a personal namespace, not the auto-trusted `kernels-community` org. |
|
|
| `resize_normalize` accepts a stacked `(N, C, H, W)` tensor or a ragged list of CHW |
| tensors. `resize_normalize_ragged` is the same kernel, list-only. |
|
|
| ## With a transformers processor |
|
|
| There is no `use_kernels=True` hook for image processors — that machinery swaps `nn.Module` |
| layer forwards inside the model, not processor code. Use the kernel directly with the |
| processor's config (see `example_transformers.py` for a runnable version): |
|
|
| ```python |
| from kernels import get_kernel |
| kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True) |
| _PIL_RESAMPLE = {0: "bilinear", 2: "bilinear", 3: "bicubic"} |
| |
| def preprocess_with_kernel(processor, images): |
| size = processor.size # must be fixed {"height", "width"}; no crop/pad |
| return kir.resize_normalize( |
| images, (size["height"], size["width"]), |
| processor.image_mean, processor.image_std, |
| rescale_factor=float(processor.rescale_factor), |
| resample=_PIL_RESAMPLE[int(processor.resample)], |
| antialias=bool(getattr(processor, "antialias", True)), |
| ) |
| ``` |
|
|
| ## Backends |
|
|
| - `backend="separable"` (default): two-pass `uint8` kernel doing `taps+taps` loads — |
| torchvision's own separable algorithm. Fastest (~3× the fast processor on the batch |
| above); parity ≤1e-4 vs torchvision-float. The float intermediate makes it more accurate |
| than, but not bit-identical to, torchvision's fixed-point `uint8` intermediate. |
| - `backend="fused"`: a single 2D launch, `taps×taps` loads per output pixel. Same parity, |
| kept as the reference path but ~9× slower than separable (the 2D float load count is the |
| reason a separable pass wins — see `benchmarks/benchmark.py`). |
|
|
| ## Parity notes |
|
|
| The resampling weights match PyTorch aten `UpSampleKernel`. Antialiased bicubic uses the |
| PIL cubic coefficient `a=-0.5`; non-antialiased bicubic uses Keys `a=-0.75`. The |
| antialias renormalize-truncate window applies on every axis, including upsampling dims. |
|
|
| ## Center crop / shortest-edge |
|
|
| Pass `crop_size` to resize then center-crop in one pass (the crop is folded into the |
| output-coordinate mapping, no extra buffer). `resize_mode="shortest_edge"` does |
| aspect-preserving resize (short side = `size`) then crop — the CLIP / DINOv2 pipeline. |
|
|
| ```python |
| # CLIP/DINOv2-style: resize shortest edge to 256, center-crop 224 |
| pv = kir.resize_normalize(images, 256, mean, std, resample="bicubic", antialias=True, |
| crop_size=224, resize_mode="shortest_edge") |
| ``` |
|
|
| `example_transformers.py` derives all of this from a processor's config automatically. |
|
|
| ## Scope |
|
|
| Resize (+ optional center crop) + rescale + normalize. It does **not** pad — padding |
| processors (many detection models) run a different pipeline. The `fused` backend is |
| resize-only; crop is handled by the `separable` backend. |
|
|