Molbap HF Staff commited on 11 days ago

Commit

e199518

verified ·

1 Parent(s): b26bdad

Upload folder using huggingface_hub

Browse files

Files changed (18) hide show

KERNEL_OPS.md +326 -0
README.md +108 -0
benchmarks/benchmark.py +293 -0
benchmarks/compat_check.py +127 -0
build.toml +7 -0
build/torch-universal/kernel_image_resize/__init__.py +113 -0
build/torch-universal/kernel_image_resize/_fused.py +134 -0
build/torch-universal/kernel_image_resize/_pack.py +62 -0
build/torch-universal/kernel_image_resize/_separable.py +280 -0
example.py +32 -0
example_transformers.py +85 -0
publish.sh +44 -0
resultcompat +16 -0
tests/test_resize_normalize.py +112 -0
torch-ext/kernel_image_resize/__init__.py +113 -0
torch-ext/kernel_image_resize/_fused.py +134 -0
torch-ext/kernel_image_resize/_pack.py +62 -0
torch-ext/kernel_image_resize/_separable.py +280 -0

KERNEL_OPS.md ADDED Viewed

	@@ -0,0 +1,326 @@

+# kernel_image_resize — how every op works (study notes)
+This explains the whole package end to end: the resampling math, the data layout, and
+every kernel op, plus the benchmark findings. It is meant for pen & paper — there is a
+fully worked numeric example you can reproduce by hand in the "Worked example" section.
+The package does one thing: **resize + rescale + normalize**, the op sequence a
+`transformers` fast image processor runs (`TorchvisionBackend`: resize, then
+`(x*rescale - mean)/std`), as a single GPU pipeline. Input: raw CHW `uint8` images
+(any size, ragged). Output: `(N, C, out_h, out_w)` normalized `float32`.
+---
+## 1. The one idea behind resizing
+Resizing does not "pick" pixels; each **output** pixel is a **weighted average of a small
+window of input pixels**. Two things define that average:
+1. **Where** in the input an output pixel lands (its center).
+2. **Which** input pixels are in its window, and with **what weights**.
+The weight of an input pixel falls off with distance from the center, following a filter
+curve:
+- **bilinear** → a triangle, width 1 on each side (so 2 input pixels per axis normally).
+- **bicubic** → a cubic bump, width 2 on each side (so 4 input pixels per axis normally).
+When you **shrink** an image, you must also **blur first** or you get aliasing. That is
+what `antialias=True` does: it widens the window so each output pixel averages more input
+pixels (a low-pass filter before throwing pixels away). Widening is proportional to the
+shrink factor, so shrinking 3× turns a 4-tap bicubic into ~13 taps.
+---
+## 2. The resampling-weight formula (the heart of everything)
+All kernels use the same formula, which matches PyTorch's aten `UpSampleKernel`
+(`align_corners=False`, "half-pixel" convention). For one axis:
+```
+scale       = in_size / out_size                       # > 1 means shrinking
+interp_half = 1 (bilinear)  or  2 (bicubic)            # half-width of the filter
+cubic_a     = -0.75 (no antialias)  or  -0.5 (antialias)   # the cubic curve's shape constant
+# antialias only widens the window, and only when shrinking:
+if antialias and scale > 1:
+    eff = scale                # window widens by the shrink factor
+else:
+    eff = 1                    # plain 2-tap / 4-tap interpolation
+support  = interp_half * eff   # half-window width, in INPUT pixels
+inv      = 1 / eff             # squashes the filter curve to match the widened window
+# for output index i:
+center     = scale * (i + 0.5)                 # input coordinate this output maps to
+first_tap  = floor(center - support + 0.5)     # leftmost input pixel in the window
+# for each tap t = 0, 1, 2, ... (up to MAX_TAPS):
+tap_pos    = first_tap + t                     # an input pixel index
+arg        = (tap_pos - center + 0.5) * inv    # distance from center, squashed
+weight     = filter(arg)                       # triangle or cubic, see below
+```
+The filter (`_resample_weight` in `_fused.py`), with `x = |arg|`:
+```
+bilinear:   max(1 - x, 0)                                            # triangle, zero past 1
+bicubic:    x <= 1 :  (a+2)x^3 - (a+3)x^2 + 1
+            1<x<2 :   a x^3 - 5a x^2 + 8a x - 4a
+            else  :   0
+```
+Two edge rules (both kernels do this identically):
+- **non-antialias**: clamp the tap index into `[0, in_size-1]` → replicates the border
+  pixel. The filter weights of a standard 2/4-tap interpolation already sum to 1.
+- **antialias**: instead set the weight to **0** for taps that fall off the image
+  (`tap_pos < 0` or `>= in_size`), then **renormalize** by dividing by the sum of the
+  realized weights. This keeps the average correct at the edges.
+That renormalization is why every kernel computes a `weight_sum` and divides by it. For
+the non-antialias case `weight_sum == 1`, so the division is a harmless no-op.
+---
+## 3. Worked example (do this by hand)
+**bilinear, no antialias, one axis, in_size=4, out_size=2.**
+```
+scale = 4/2 = 2,  interp_half = 1,  eff = 1,  support = 1,  inv = 1
+```
+Output pixel `i = 0`:
+```
+center    = 2 * (0 + 0.5) = 1.0
+first_tap = floor(1.0 - 1 + 0.5) = floor(0.5) = 0
+t=0: tap_pos=0, arg=(0-1.0+0.5)= -0.5 -> weight = 1-0.5 = 0.5
+t=1: tap_pos=1, arg=(1-1.0+0.5)=  0.5 -> weight = 1-0.5 = 0.5
+t=2: tap_pos=2, arg=(2-1.0+0.5)=  1.5 -> weight = max(1-1.5,0) = 0
+weight_sum = 1.0
+output[0] = (0.5*in[0] + 0.5*in[1]) / 1.0       # halfway between in[0] and in[1]
+```
+Output pixel `i = 1`:
+```
+center    = 2 * 1.5 = 3.0
+first_tap = floor(3.0 - 0.5) = 2
+t=0: tap_pos=2, arg=-0.5 -> 0.5
+t=1: tap_pos=3, arg= 0.5 -> 0.5
+t=2: tap_pos=4, arg= 1.5 -> 0   (index 4 would clamp to 3, but weight is 0 anyway)
+output[1] = 0.5*in[2] + 0.5*in[3]
+```
+This 1-D operation is exactly one pass of the separable kernel. The 2-D result is the same
+formula applied on both axes (rows and columns).
+---
+## 4. Data layout (host side, `_pack.py`)
+Ragged images (different H×W) cannot be stacked into one tensor, so they are flattened and
+concatenated into one buffer, with side tables describing each image.
+`pack_images(images, dtype)` →
+```
+input_pixels : 1-D buffer, all images flattened (C,H,W row-major) and concatenated
+offsets[n]   : element index where image n starts
+heights[n], widths[n] : that image's H and W
+channels     : C (shared by all images)
+```
+Address of input pixel `(channel, row, col)` of image `n`:
+```
+input_pixels[ offsets[n] + channel*(H*W) + row*W + col ]
+```
+The separable path packs as `uint8` (1 byte/pixel, half the memory traffic of float).
+`fold_mean_std(mean, std, rescale)` → folds the rescale factor into the normalization
+constants so the kernel does a single `(x - m)/s`:
+```
+m = mean / rescale       s = std / rescale
+(x - m)/s  ==  (x*rescale - mean)/std      # identical to the processor's fused normalize
+```
+`max_taps(images, out_size, axis, interp, antialias)` → the **widest** window in the batch
+= `ceil(support) * 2 + 1`. A Triton loop bound must be a compile-time constant, so every
+program loops this fixed count; taps beyond a given pixel's real window get ~0 weight.
+`as_image_list` → accepts a stacked `(N,C,H,W)` tensor or a list, always returns a list.
+---
+## 5. Fused kernel (`_fused.py`, `backend="fused"`)
+One launch. **One program = one image + a BLOCK of its output pixels.** Each output pixel
+reads the **full 2-D window** directly: `MAX_TAPS_H × MAX_TAPS_W` input pixels.
+```
+grid = (num_images, ceil(out_h*out_w / BLOCK))
+per lane (one output pixel):
+  oy, ox            = (flat_index // out_w, flat_index % out_w)
+  center_y, center_x, first_tap_y, first_tap_x       # section 2, both axes
+  # weight_sum factorizes across axes (separable math, even though the LOADS are 2-D):
+  sum_wy = Σ_ty filter_y      ;  sum_wx = Σ_tx filter_x      ;  denom = sum_wy * sum_wx
+  for channel:
+    acc = 0
+    for ty in 0..MAX_TAPS_H:                 #  <-- the 2-D window: TAPS_H * TAPS_W loads
+      for tx in 0..MAX_TAPS_W:
+        weight = filter_y(ty) * filter_x(tx)
+        pixel  = input_pixels[channel, clamp(tap_y), clamp(tap_x)]
+        acc   += weight * pixel
+    acc = acc / denom
+    out[image, channel, oy, ox] = (acc - mean[channel]) / std[channel]
+```
+Cost per output pixel: `TAPS_H * TAPS_W` loads (e.g. 13×13 = **169**). Correct and simple,
+but the 2-D load count is what makes it slow — hence the separable version.
+---
+## 6. Separable kernel (`_separable.py`, `backend="separable"`, the default)
+Same math, but the 2-D window is done as **two 1-D passes**, with a float intermediate
+buffer in between. Loads per output pixel: `TAPS_W + TAPS_H` (e.g. 13+13 = **26**).
+```
+input_pixels (uint8, C×H×W)  --pass1-->  intermediate (float, C×H×out_w)  --pass2-->  output (float, C×out_h×out_w)
+                              resize W                               resize H + normalize
+```
+The **intermediate** is the key object: same **height** as the input, but already the
+**final width**. ("Tall and narrow.") It is also ragged in height, so it gets its own
+offset table (built in `separable_resize_normalize`, same scheme as `pack_images`).
+### Pass 1 — `_horizontal_resize_kernel` (resize width only)
+```
+grid = (num_images, ceil(H*out_w / BLOCK))     # work = every input row × every output col
+per lane:
+  input_row = flat_index // out_w     # row index, UNCHANGED by this pass
+  out_col   = flat_index %  out_w     # output column being computed
+  center_x, first_tap_x, col_weight_sum     # section 2, COLUMN axis only
+  for channel:
+    acc = 0
+    for tap in 0..MAX_TAPS_COL:                          #  <-- 1-D: only TAPS_W loads
+      weight = filter_x(tap)
+      pixel  = input_pixels[channel, input_row, clamp(tap_col)]   # uint8 -> float
+      acc   += weight * pixel
+    acc = acc / col_weight_sum
+    intermediate[channel, input_row, out_col] = acc      # NO normalize yet
+```
+Reads original `uint8` bytes; writes `float32`. No normalization here.
+### Pass 2 — `_vertical_resize_normalize_kernel` (resize height, then normalize)
+```
+grid = (num_images, ceil(out_h*out_w / BLOCK))     # work = every output pixel
+per lane:
+  out_row = flat_index // out_w
+  out_col = flat_index %  out_w
+  center_y, first_tap_y, row_weight_sum     # section 2, ROW axis only
+  for channel:
+    acc = 0
+    for tap in 0..MAX_TAPS_ROW:                          #  <-- 1-D: only TAPS_H loads
+      weight = filter_y(tap)
+      pixel  = intermediate[channel, clamp(tap_row), out_col]      # float
+      acc   += weight * pixel
+    acc = acc / row_weight_sum
+    out[image, channel, out_row, out_col] = (acc - mean[channel]) / std[channel]   # normalize here
+```
+Two launches (an implicit sync between them), so pass 2 always sees pass 1's finished
+output.
+### Why separable wins
+`TAPS_W + TAPS_H` loads instead of `TAPS_W * TAPS_H`. For a 13×13 window that is 26 vs 169.
+This is exactly the algorithm PIL and torchvision use. The catch: an extra full-size float
+intermediate buffer (more memory traffic), but the read-count reduction dominates.
+Parity note: the intermediate here is **float32**; torchvision keeps a **fixed-point
+uint8** intermediate. So the separable output is parity-*close* to torchvision, not
+bit-identical — and the float version is actually the more accurate one.
+---
+## 7. Public API (`__init__.py`)
+```
+resize_normalize(images, size, image_mean, image_std,
+                 rescale_factor=1/255, resample="bilinear", antialias=False,
+                 backend="separable", block=256)
+```
+- `images`: stacked `(N,C,H,W)` tensor or a list of CHW tensors.
+- `size`: int (square), `(H,W)`, or `{"height","width"}`.
+- `resample`: `"bilinear"`/`"bicubic"`, or a PIL resample int (0/2→bilinear, 3→bicubic).
+- `backend`: `"separable"` (default, fastest) or `"fused"` (2-D reference).
+- `resize_normalize_ragged`: same kernels, list-only.
+---
+## 8. Benchmark findings (A100, CUDA_VISIBLE_DEVICES=1)
+### Standalone resize+normalize — SigLIP-so400m config, N=32 ragged 384–1024², out 384², bicubic+AA
+```
+torchvision eager loop  :   2.91 ms   (per-image float loop)
+torchvision compiled    :   5.70 ms   (torch.compile dynamic, per-image; slower than eager)
+torchvision compiled pkt:   2.55 ms   (one graph over a padded stack; timing only)
+fused triton (2D)       :  11.49 ms   (taps*taps; the slow reference)
+separable triton (uint8):   1.29 ms   (taps+taps)   <-- fastest
+real processor          :   3.92 ms
+```
+**Separable is ~3× the real processor**, parity ≤1e-4 vs torchvision-float. The fused 2-D
+loses for the algorithmic reason above (169 vs 26 loads). `torch.compile` does not help:
+per-image it is *slower* (dispatch overhead over 32 ragged shapes); even as one packed
+graph it only matches the eager loop, because inductor's interpolate is no faster than aten
+resize.
+### End-to-end inference — Siglip2-base-patch16-224, **bf16** forward
+```
+                preprocess   forward(fixed input)   preprocess+forward
+processor          3.99           12.86                  14.44
+separable          0.93           13.02                  13.76     <-- ~5% faster e2e
+fused              2.00           13.01                  14.79
+compiled           6.14           12.89                  14.00
+feature parity (separable/fused/compiled vs processor): 9.38e-2 = 1.2% of feature max
+```
+- `forward(fixed input)` is identical (~12.9 ms) for all → **no inference regression**; the
+  model does not care which preprocessor made the tensor.
+- The 1.2% feature drift is the float-vs-uint8 resize difference, identical across all
+  float backends → not a bug. The float path is the more accurate one.
+- End-to-end win is ~5% with a bf16 forward (was ~0.5% with fp32, where the forward was
+  ~80 ms). **The win scales with how preprocessing-bound you are.**
+### Data path from JPEG bytes — 552 KB/img
+```
+CPU decode + torchvision resize :  177.5 ms   (status quo)
+CPU decode + separable kernel   :  176.4 ms   (kernel saves ~1 ms; decode dominates)
+GPU decode (nvJPEG) + kernel    :   14.8 ms   (fully on-GPU)
+```
+- ~175 ms of the 177 ms is **CPU JPEG decode + host→device copy**. Resize/normalize is ~1%.
+- The 12× win (177→15) is **GPU decode (nvJPEG)**, i.e. `torchvision.io.decode_jpeg(device="cuda")`
+  — *not* the kernel. The kernel is the resize/normalize component of that GPU pipeline.
+---
+## 9. What is true / what to claim
+- The kernel is **correct** (≤1e-4 vs torchvision-float, more accurate than the processor's
+  uint8 path) and feeds the model with **no inference regression**.
+- It is **~3× the real processor at the resize/normalize stage** — a real, parity-clean win.
+- It does **not** speed up preprocessing 12×. Decode dominates the data path; the GPU-decode
+  lever is nvJPEG, a torchvision feature, not this kernel.
+- The kernel matters end-to-end only once you are **not decode-bound**: in a GPU-decode
+  pipeline it keeps resize/normalize minimal (~10% of that pipeline), and its standalone
+  preprocess win shows up when the forward is small (bf16, small model, large batch).
+- Honest one-liner: *"GPU-native resize+normalize, 3× the fast processor at that stage,
+  drop-in for a GPU-decode pipeline."*

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+tags:
+  - kernel
+---
+# kernel_image_resize
+A pure-Triton Hub kernel that fuses the **resize + rescale + normalize** preprocessing
+pipeline run by ~150 `transformers` fast image processors (`TorchvisionBackend`: resize →
+fold(rescale, normalize)) into a single GPU pass. It takes raw CHW `uint8` images and
+returns the normalized `(N, C, out_h, out_w)` float tensor with no intermediate
+full-resolution float buffer.
+On a ragged SigLIP-so400m batch (A100, N=32, inputs 384–1024², out 384², bicubic+antialias)
+the default backend runs in **1.29 ms/iter vs 3.90 ms for the fast processor (~3× faster)**
+and 2.89 ms for torchvision's own per-image loop, at parity ≤1e-4 vs torchvision-float.
+It ships as a `kernels` universal build variant (no compiled extension, just Triton), so it
+loads on any CUDA PyTorch build via `get_kernel`.
+## Usage
+```python
+import torch
+from kernels import get_kernel
+kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True)
+# a list of different-H×W uint8 CHW images (the ragged case torchvision loops over)
+images = [torch.randint(0, 256, (3, h, w), dtype=torch.uint8, device="cuda")
+          for h, w in [(640, 480), (800, 600), (384, 1024)]]
+pixel_values = kir.resize_normalize(
+    images,
+    size=384,                      # int (square), (H, W), or {"height", "width"}
+    image_mean=[0.5, 0.5, 0.5],
+    image_std=[0.5, 0.5, 0.5],
+    rescale_factor=1 / 255,
+    resample="bicubic",            # or "bilinear", or a PIL resample int
+    antialias=True,                # match the ViT/CLIP/SigLIP default
+)
+# -> (3, 3, 384, 384) float32, ready for the model
+```
+`trust_remote_code=True` is required because this is a personal namespace (not the trusted
+`kernels-community` org). `revision="main"` loads the current code; tag a `v1.0.0` release if
+you want `version=1` loading instead.
+`resize_normalize` accepts a stacked `(N, C, H, W)` tensor or a ragged list of CHW
+tensors. `resize_normalize_ragged` is the same kernel, list-only.
+## With a transformers processor
+There is no `use_kernels=True` hook for image processors — that machinery swaps `nn.Module`
+layer forwards inside the model, not processor code. Use the kernel directly with the
+processor's config (see `example_transformers.py` for a runnable version):
+```python
+from kernels import get_kernel
+kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True)
+_PIL_RESAMPLE = {0: "bilinear", 2: "bilinear", 3: "bicubic"}
+def preprocess_with_kernel(processor, images):
+    size = processor.size  # must be fixed {"height", "width"}; no crop/pad
+    return kir.resize_normalize(
+        images, (size["height"], size["width"]),
+        processor.image_mean, processor.image_std,
+        rescale_factor=float(processor.rescale_factor),
+        resample=_PIL_RESAMPLE[int(processor.resample)],
+        antialias=bool(getattr(processor, "antialias", True)),
+    )
+```
+## Backends
+- `backend="separable"` (default): two-pass `uint8` kernel doing `taps+taps` loads —
+  torchvision's own separable algorithm. Fastest (~3× the fast processor on the batch
+  above); parity ≤1e-4 vs torchvision-float. The float intermediate makes it more accurate
+  than, but not bit-identical to, torchvision's fixed-point `uint8` intermediate.
+- `backend="fused"`: a single 2D launch, `taps×taps` loads per output pixel. Same parity,
+  kept as the reference path but ~9× slower than separable (the 2D float load count is the
+  reason a separable pass wins — see `benchmarks/benchmark.py`).
+## Parity notes
+The resampling weights match PyTorch aten `UpSampleKernel`. Antialiased bicubic uses the
+PIL cubic coefficient `a=-0.5`; non-antialiased bicubic uses Keys `a=-0.75`. The
+antialias renormalize-truncate window applies on every axis, including upsampling dims.
+## Center crop / shortest-edge
+Pass `crop_size` to resize then center-crop in one pass (the crop is folded into the
+output-coordinate mapping, no extra buffer). `resize_mode="shortest_edge"` does
+aspect-preserving resize (short side = `size`) then crop — the CLIP / DINOv2 pipeline.
+```python
+# CLIP/DINOv2-style: resize shortest edge to 256, center-crop 224
+pv = kir.resize_normalize(images, 256, mean, std, resample="bicubic", antialias=True,
+                          crop_size=224, resize_mode="shortest_edge")
+```
+`example_transformers.py` derives all of this from a processor's config automatically.
+## Scope
+Resize (+ optional center crop) + rescale + normalize. It does **not** pad — padding
+processors (many detection models) run a different pipeline. The `fused` backend is
+resize-only; crop is handled by the `separable` backend.

benchmarks/benchmark.py ADDED Viewed

	@@ -0,0 +1,293 @@

+"""Benchmark resize+normalize: separable / fused triton vs torchvision vs the real processor.
+    PYTHONPATH=../torch-ext python benchmark.py --processor google/siglip-so400m-patch14-384
+    PYTHONPATH=../torch-ext python benchmark.py --n 32 --out 384 384 --interp bicubic --antialias
+Prints parity (vs torchvision-float) per backend, then ms/iter for each path. Needs CUDA.
+"""
+import argparse
+import sys
+import time
+# Hide `kernels` from transformers: this worktree builds kernels.LayerRepository without a version,
+# which newer `kernels` rejects at import. Preprocessing needs no hub layer kernels.
+sys.modules["kernels"] = None
+import torch
+import torchvision.transforms.v2.functional as tvF
+from torchvision.io import ImageReadMode, decode_jpeg, encode_jpeg
+from torchvision.transforms import InterpolationMode
+from kernel_image_resize import resize_normalize
+from kernel_image_resize._pack import PIL_RESAMPLE_TO_INTERP, max_taps
+_TV_INTERP = {"bilinear": InterpolationMode.BILINEAR, "bicubic": InterpolationMode.BICUBIC}
+def make_ragged_images(n, device, min_res, max_res, seed=0):
+    g = torch.Generator(device="cpu").manual_seed(seed)
+    images = []
+    for _ in range(n):
+        h = int(torch.randint(min_res, max_res + 1, (1,), generator=g).item())
+        w = int(torch.randint(min_res, max_res + 1, (1,), generator=g).item())
+        images.append(torch.randint(0, 256, (3, h, w), generator=g, dtype=torch.uint8).to(device))
+    return images
+def torchvision_reference(images, out_h, out_w, mean, std, rescale, interp, antialias):
+    mode = _TV_INTERP[interp]
+    mean_t = torch.tensor(mean, device=images[0].device).view(3, 1, 1)
+    std_t = torch.tensor(std, device=images[0].device).view(3, 1, 1)
+    outs = []
+    for img in images:
+        r = tvF.resize(img.float(), [out_h, out_w], interpolation=mode, antialias=antialias)
+        outs.append((r * rescale - mean_t) / std_t)
+    return torch.stack(outs)
+def build_compiled_reference(out_h, out_w, mean, std, rescale, interp, antialias, device):
+    """torch.compile(dynamic=True) of a per-image float resize+normalize."""
+    import torch.nn.functional as F
+    mean_t = torch.tensor(mean, device=device).view(3, 1, 1)
+    std_t = torch.tensor(std, device=device).view(3, 1, 1)
+    mode = "bicubic" if interp == "bicubic" else "bilinear"
+    def _one(img):
+        r = F.interpolate(img.unsqueeze(0).float(), size=(out_h, out_w), mode=mode, antialias=antialias, align_corners=False)
+        return (r.squeeze(0) * rescale - mean_t) / std_t
+    compiled = torch.compile(_one, dynamic=True)
+    def run(images):
+        return torch.stack([compiled(img) for img in images])
+    return run
+def pad_stack(images):
+    """Pad ragged CHW images to the batch-max H/W and stack into (N, C, Hmax, Wmax)."""
+    c = images[0].shape[0]
+    max_h = max(img.shape[1] for img in images)
+    max_w = max(img.shape[2] for img in images)
+    out = torch.zeros(len(images), c, max_h, max_w, dtype=images[0].dtype, device=images[0].device)
+    for i, img in enumerate(images):
+        out[i, :, : img.shape[1], : img.shape[2]] = img
+    return out
+def build_packed_compiled_reference(out_h, out_w, mean, std, rescale, interp, antialias, device):
+    """torch.compile of a single batched resize+normalize over a stacked (N, C, H, W) tensor."""
+    import torch.nn.functional as F
+    mean_t = torch.tensor(mean, device=device).view(1, 3, 1, 1)
+    std_t = torch.tensor(std, device=device).view(1, 3, 1, 1)
+    mode = "bicubic" if interp == "bicubic" else "bilinear"
+    def _batch(stacked):
+        r = F.interpolate(stacked.float(), size=(out_h, out_w), mode=mode, antialias=antialias, align_corners=False)
+        return (r * rescale - mean_t) / std_t
+    return torch.compile(_batch, dynamic=True)
+def run_inference(model_id, images, block, iters, device):
+    """End-to-end: preprocess (processor / separable / fused / compiled) -> vision features (bf16 forward).
+    Checks each kernel feeds the model with no feature drift and times the full pipeline."""
+    from transformers import AutoModel
+    proc, (out_h, out_w, mean, std, rescale, interp, antialias) = load_processor_config(model_id)
+    model = AutoModel.from_pretrained(model_id).to(device=device, dtype=torch.bfloat16).eval()
+    vision = model.vision_model
+    kk = dict(size=(out_h, out_w), image_mean=mean, image_std=std, rescale_factor=rescale,
+              resample=interp, antialias=antialias, block=block)
+    @torch.no_grad()
+    def features(pixel_values):
+        out = vision(pixel_values=pixel_values.to(model.dtype))
+        pooled = getattr(out, "pooler_output", None)
+        return pooled if pooled is not None else out.last_hidden_state
+    compiled_one = build_compiled_reference(out_h, out_w, mean, std, rescale, interp, antialias, device)
+    methods = {
+        "processor": lambda: proc(images, return_tensors="pt", device=device)["pixel_values"],
+        "separable": lambda: resize_normalize(images, backend="separable", **kk),
+        "fused": lambda: resize_normalize(images, backend="fused", **kk),
+        "compiled": lambda: compiled_one(images),
+    }
+    methods["compiled"]()  # warmup the compiled artifact
+    methods["compiled"]()
+    torch.cuda.synchronize()
+    print(f"\n[infer] {model_id}  out={out_h}x{out_w}  forward dtype=bfloat16")
+    base = features(methods["processor"]())
+    base_scale = base.abs().max().item()
+    for name in ("separable", "fused", "compiled"):
+        d = (features(methods[name]()) - base).abs().max().item()
+        print(f"[infer parity] features {name} vs processor: max|Δ| = {d:.2e}  ({d / base_scale:.1%} of feature max)")
+    # forward is timed on a FIXED precomputed tensor, so it is method-independent by construction;
+    # if it varies across rows, the preprocessor's output (dtype/contiguity) is hurting the model.
+    print("[infer] ms/iter:    preprocess   forward(fixed input)   preprocess+forward")
+    for name, preprocess in methods.items():
+        pixel_values = preprocess()
+        pre = _time(preprocess, iters, device)
+        fwd = _time(lambda pixel_values=pixel_values: features(pixel_values), iters, device)
+        e2e = _time(lambda preprocess=preprocess: features(preprocess()), iters, device)
+        print(f"  {name:10s}    {pre:8.3f}     {fwd:8.3f}              {e2e:8.3f}")
+def run_decode(images_cpu, out_h, out_w, mean, std, rescale, interp, antialias, block, iters, device):
+    """Data-path table from JPEG bytes: CPU decode (libjpeg) vs GPU decode (nvJPEG) + the kernel.
+    decoders differ at the pixel level (nvJPEG vs libjpeg), so this measures wall-clock, not parity.
+    """
+    jpeg = [encode_jpeg(img, quality=95) for img in images_cpu]
+    avg_kb = sum(b.numel() for b in jpeg) / len(jpeg) / 1024
+    kk = dict(size=(out_h, out_w), image_mean=mean, image_std=std, rescale_factor=rescale,
+              resample=interp, antialias=antialias, block=block)
+    def cpu_decode_kernel():
+        imgs = [decode_jpeg(b, mode=ImageReadMode.RGB).to(device) for b in jpeg]
+        return resize_normalize(imgs, backend="separable", **kk)
+    def gpu_decode_kernel():
+        imgs = decode_jpeg(jpeg, mode=ImageReadMode.RGB, device=device)
+        return resize_normalize(imgs, backend="separable", **kk)
+    def gpu_decode_torchvision():
+        imgs = decode_jpeg(jpeg, mode=ImageReadMode.RGB, device=device)
+        return torchvision_reference(imgs, out_h, out_w, mean, std, rescale, interp, antialias)
+    def cpu_decode_torchvision():
+        imgs = [decode_jpeg(b, mode=ImageReadMode.RGB).to(device) for b in jpeg]
+        return torchvision_reference(imgs, out_h, out_w, mean, std, rescale, interp, antialias)
+    print(f"\n[decode] N={len(jpeg)}  avg={avg_kb:.0f} KB/img  out={out_h}x{out_w}  (from JPEG bytes, ms/iter)")
+    print(f"  CPU decode + torchvision resize : {_time(cpu_decode_torchvision, iters, device):8.3f}   [status quo data path]")
+    print(f"  CPU decode + separable kernel   : {_time(cpu_decode_kernel, iters, device):8.3f}")
+    print(f"  GPU decode (nvJPEG) + tv resize : {_time(gpu_decode_torchvision, iters, device):8.3f}   [GPU pipeline, tv resize]")
+    print(f"  GPU decode (nvJPEG) + kernel    : {_time(gpu_decode_kernel, iters, device):8.3f}   [GPU pipeline, kernel resize]")
+def load_processor_config(name):
+    from transformers import AutoImageProcessor
+    proc = AutoImageProcessor.from_pretrained(name, backend="torchvision")
+    size = proc.size
+    if "height" not in size or "width" not in size:
+        raise ValueError(f"{name}: size={size} is not a fixed (height, width)")
+    out_h, out_w = size["height"], size["width"]
+    interp = PIL_RESAMPLE_TO_INTERP.get(int(proc.resample))
+    rescale = float(proc.rescale_factor) if getattr(proc, "do_rescale", True) else 1.0
+    antialias = bool(getattr(proc, "antialias", True))
+    return proc, (out_h, out_w, list(proc.image_mean), list(proc.image_std), rescale, interp, antialias)
+def _time(fn, iters, device):
+    for _ in range(3):
+        fn()
+    if device.type == "cuda":
+        torch.cuda.synchronize()
+        start, end = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
+        start.record()
+        for _ in range(iters):
+            fn()
+        end.record()
+        torch.cuda.synchronize()
+        return start.elapsed_time(end) / iters
+    t0 = time.perf_counter()
+    for _ in range(iters):
+        fn()
+    return (time.perf_counter() - t0) / iters * 1e3
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--processor", default=None)
+    parser.add_argument("--n", type=int, default=32)
+    parser.add_argument("--out", type=int, nargs=2, default=[384, 384], metavar=("H", "W"))
+    parser.add_argument("--interp", choices=["bilinear", "bicubic"], default="bicubic")
+    parser.add_argument("--antialias", action="store_true")
+    parser.add_argument("--min-res", type=int, default=384)
+    parser.add_argument("--max-res", type=int, default=1024)
+    parser.add_argument("--iters", type=int, default=50)
+    parser.add_argument("--block", type=int, default=256)
+    parser.add_argument("--tol", type=float, default=3e-3)
+    parser.add_argument("--infer", action="store_true", help="end-to-end Siglip2 inference comparison (bf16 forward)")
+    parser.add_argument("--model", default="google/siglip2-base-patch16-224", help="model for --infer")
+    parser.add_argument("--decode", action="store_true", help="JPEG-decode data-path table (CPU vs GPU/nvJPEG) and stop")
+    args = parser.parse_args()
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    if device.type != "cuda":
+        print("benchmark needs CUDA.")
+        return
+    proc = None
+    if args.processor:
+        proc, (out_h, out_w, mean, std, rescale, interp, antialias) = load_processor_config(args.processor)
+        print(f"processor={args.processor}  ->  out={out_h}x{out_w}  interp={interp}  antialias={antialias}")
+    else:
+        out_h, out_w = args.out
+        mean = [0.48145466, 0.4578275, 0.40821073]
+        std = [0.26862954, 0.26130258, 0.27577711]
+        rescale = 1.0 / 255.0
+        interp, antialias = args.interp, args.antialias
+    images = make_ragged_images(args.n, device, args.min_res, args.max_res)
+    taps = (max_taps(images, out_h, 1, interp, antialias), max_taps(images, out_w, 2, interp, antialias))
+    print(f"N={args.n}  in∈[{args.min_res},{args.max_res}]² ragged  out={out_h}x{out_w}  "
+          f"interp={interp}  antialias={antialias}  max_taps={taps}  iters={args.iters}\n")
+    if args.decode:
+        images_cpu = make_ragged_images(args.n, torch.device("cpu"), args.min_res, args.max_res)
+        run_decode(images_cpu, out_h, out_w, mean, std, rescale, interp, antialias, args.block, args.iters, device)
+        return
+    ref = torchvision_reference(images, out_h, out_w, mean, std, rescale, interp, antialias)
+    common = dict(size=(out_h, out_w), image_mean=mean, image_std=std, rescale_factor=rescale,
+                  resample=interp, antialias=antialias, block=args.block)
+    for backend in ("fused", "separable"):
+        got = resize_normalize(images, backend=backend, **common)
+        d = (got - ref).abs().max().item()
+        print(f"[parity] {backend:9s} vs torchvision(float): max|Δ| = {d:.2e}  "
+              f"({'PASS' if d < args.tol else 'FAIL'} @ tol={args.tol})")
+    print()
+    compiled_run = build_compiled_reference(out_h, out_w, mean, std, rescale, interp, antialias, device)
+    packed = pad_stack(images)
+    packed_compiled_run = build_packed_compiled_reference(out_h, out_w, mean, std, rescale, interp, antialias, device)
+    t0 = time.perf_counter()
+    compiled_run(images)
+    compiled_run(images)
+    packed_compiled_run(packed)
+    packed_compiled_run(packed)
+    torch.cuda.synchronize()
+    t_warmup = (time.perf_counter() - t0) * 1e3
+    t_eager = _time(lambda: torchvision_reference(images, out_h, out_w, mean, std, rescale, interp, antialias), args.iters, device)
+    t_comp = _time(lambda: compiled_run(images), args.iters, device)
+    t_comp_packed = _time(lambda: packed_compiled_run(packed), args.iters, device)
+    t_fused = _time(lambda: resize_normalize(images, backend="fused", **common), args.iters, device)
+    t_sep = _time(lambda: resize_normalize(images, backend="separable", **common), args.iters, device)
+    print("Resize+normalize only (no decode/H2D), ms/iter:")
+    print(f"  torchvision eager loop  : {t_eager:8.3f}   [per-image float loop]")
+    print(f"  torchvision compiled    : {t_comp:8.3f}   [torch.compile dynamic per-image; warmup {t_warmup:.0f} ms excluded]")
+    print(f"  torchvision compiled pkt: {t_comp_packed:8.3f}   [one graph over padded (N,C,Hmax,Wmax) stack; timing only, padding alters output]")
+    print(f"  fused triton (2D)       : {t_fused:8.3f}   [taps*taps]")
+    print(f"  separable triton (uint8): {t_sep:8.3f}   [taps+taps]")
+    if proc is not None:
+        t_pr = _time(lambda: proc(images, return_tensors="pt", device=device)["pixel_values"], args.iters, device)
+        print(f"\n  {args.processor} : {t_pr:8.3f} ms/iter")
+        print(f"  -> separable is {t_sep / t_pr:.2f}x the real processor")
+    if args.infer:
+        run_inference(args.model, images, args.block, args.iters, device)
+if __name__ == "__main__":
+    main()

benchmarks/compat_check.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""Compatibility + forward-parity sweep over the most-downloaded image models (one per architecture).
+For each model: load its processor + model, decide whether the kernel can stand in for the
+processor's resize(+crop)+normalize, and for supported ones run processor-vs-kernel pixel_values
+through the SAME vision tower and report pixel + feature parity. Unsupported models list the reason.
+Run on the DGX (CUDA + working transformers):
+    PYTHONPATH=../torch-ext python compat_check.py
+"""
+import sys
+# This transformers worktree constructs kernels.LayerRepository without a version/revision, which
+# newer `kernels` rejects at import. We do not need hub LAYER kernels for preprocessing, so hide
+# `kernels` from transformers — it falls back to its no-hub-kernels stub path and imports cleanly.
+sys.modules["kernels"] = None
+import torch
+from kernel_image_resize import resize_normalize  # local package, via PYTHONPATH=../torch-ext
+_PIL_RESAMPLE = {2: "bilinear", 3: "bicubic"}
+# Top image models by HF downloads (June 2026), deduplicated to one repo per architecture family.
+MODELS = [
+    ("openai/clip-vit-base-patch32", 20528683),                  # clip
+    ("google/vit-base-patch16-224", 4910416),                    # vit
+    ("apple/mobilevit-small", 3488074),                          # mobilevit
+    ("facebook/dinov2-small", 2602780),                          # dinov2
+    ("google/siglip-so400m-patch14-384", 1379598),               # siglip
+    ("facebook/dinov3-vitb16-pretrain-lvd1689m", 467337),        # dinov3
+    ("microsoft/swinv2-tiny-patch4-window16-256", 385713),       # swinv2
+    ("google/siglip2-base-patch16-224", 336824),                 # siglip2
+    ("microsoft/resnet-50", 307057),                             # resnet (convnext processor)
+    ("nvidia/segformer-b0-finetuned-ade-512-512", 262459),       # segformer
+    ("facebook/convnextv2-tiny-22k-384", 48614),                 # convnextv2
+    ("google/mobilenet_v2_1.0_224", 48342),                      # mobilenet
+    ("facebook/convnext-tiny-224", 16984),                       # convnext
+    ("google/efficientnet-b0", 8577),                            # efficientnet
+    ("microsoft/beit-base-patch16-224-pt22k-ft22k", 7529),       # beit
+]
+def unsupported_reason(p):
+    """Return None if the kernel can stand in for this processor, else a short reason."""
+    if not getattr(p, "do_resize", True):
+        return "no resize"
+    if not getattr(p, "do_normalize", True):
+        return "no normalize (rescale only)"
+    if getattr(p, "do_flip_channel_order", False):
+        return "channel flip (BGR)"
+    if getattr(p, "do_pad", False):
+        return "pad"
+    if int(getattr(p, "resample", 2)) not in _PIL_RESAMPLE:
+        return f"resample {p.resample}"
+    size = getattr(p, "size", {}) or {}
+    crop = p.crop_size if getattr(p, "do_center_crop", False) else None
+    if "shortest_edge" in size:
+        return None if crop else "shortest_edge without crop (variable output)"
+    if "height" in size and "width" in size:
+        return None
+    return f"size {size}"
+def preprocess_with_kernel(p, images):
+    size = p.size
+    resample = _PIL_RESAMPLE[int(p.resample)]
+    antialias = bool(getattr(p, "antialias", True))
+    rescale = float(p.rescale_factor) if getattr(p, "do_rescale", True) else 1.0
+    mean, std = p.image_mean, p.image_std
+    crop = p.crop_size if getattr(p, "do_center_crop", False) else None
+    common = dict(rescale_factor=rescale, resample=resample, antialias=antialias)
+    if "shortest_edge" in size:
+        return resize_normalize(
+            images, size["shortest_edge"], mean, std,
+            crop_size=(crop["height"], crop["width"]), resize_mode="shortest_edge", **common)
+    if crop is not None and (crop["height"] != size["height"] or crop["width"] != size["width"]):
+        return resize_normalize(
+            images, (size["height"], size["width"]), mean, std,
+            crop_size=(crop["height"], crop["width"]), resize_mode="square", **common)
+    return resize_normalize(images, (size["height"], size["width"]), mean, std, **common)
+def vision_features(model, pixel_values):
+    tower = getattr(model, "vision_model", model)
+    out = tower(pixel_values=pixel_values.to(model.dtype))
+    for attr in ("pooler_output", "last_hidden_state"):
+        value = getattr(out, attr, None)
+        if value is not None and torch.is_tensor(value):
+            return value
+    return out[0]
+def main():
+    from transformers import AutoImageProcessor, AutoModel  # lazy: avoids importing the kernels lib first
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    images = [
+        torch.randint(0, 256, (3, h, w), dtype=torch.uint8, device=device)
+        for h, w in [(640, 480), (800, 600), (512, 512), (384, 1024)]
+    ]
+    print(f"{'model':46s} {'verdict':10s} pixel max|Δ|   feature max|Δ| (rel)")
+    for model_id, _ in MODELS:
+        try:
+            processor = AutoImageProcessor.from_pretrained(model_id)
+            reason = unsupported_reason(processor)
+            if reason is not None:
+                print(f"{model_id:46s} SKIP: {reason}")
+                continue
+            model = AutoModel.from_pretrained(model_id).to(device).eval()
+            reference_pv = processor(images, return_tensors="pt", device=device)["pixel_values"].to(device)
+            kernel_pv = preprocess_with_kernel(processor, images)
+            pixel_delta = (kernel_pv - reference_pv).abs().max().item()
+            with torch.no_grad():
+                base = vision_features(model, reference_pv)
+                feat_delta = (vision_features(model, kernel_pv) - base).abs().max().item()
+            rel = feat_delta / base.abs().max().item()
+            print(f"{model_id:46s} OK         {pixel_delta:.2e}      {feat_delta:.2e} ({rel:.1%})")
+            del model
+            torch.cuda.empty_cache()
+        except Exception as e:
+            print(f"{model_id:46s} ERROR: {type(e).__name__}: {str(e)[:55]}")
+if __name__ == "__main__":
+    main()

build.toml ADDED Viewed

	@@ -0,0 +1,7 @@

+[general]
+name = "kernel_image_resize"
+universal = true
+version = 1
+[general.hub]
+repo-id = "Molbap/kernel_image_resize"

build/torch-universal/kernel_image_resize/__init__.py ADDED Viewed

	@@ -0,0 +1,113 @@

+"""Resize + rescale + normalize for transformers fast image processors, as a Triton kernel.
+resize -> fold(rescale, normalize) in one GPU pipeline: CHW uint8 images in,
+(N, C, out_h, out_w) normalized float out, no full-resolution float intermediate.
+- resize_normalize        — stacked (N, C, H, W) tensor or a list of CHW images.
+- resize_normalize_ragged — same kernels; takes a list of different-H/W CHW tensors.
+backend="separable" (default): two-pass uint8, taps+taps. backend="fused": single 2D
+launch, taps*taps. Both parity <=1e-4 vs torchvision-float.
+    from kernels import get_kernel
+    kir = get_kernel("Molbap/kernel_image_resize")
+    pixel_values = kir.resize_normalize(
+        images, size=384, image_mean=[...], image_std=[...], resample="bicubic", antialias=True,
+    )
+"""
+from ._fused import fused_resize_normalize
+from ._pack import PIL_RESAMPLE_TO_INTERP, as_image_list
+from ._separable import separable_resize_crop_normalize, separable_resize_normalize
+def _normalize_size(size) -> tuple[int, int]:
+    if isinstance(size, int):
+        return size, size
+    if isinstance(size, dict):
+        if "height" in size and "width" in size:
+            return int(size["height"]), int(size["width"])
+        raise ValueError(f"size dict must hold 'height'/'width' for a fixed resize, got {size}")
+    out_h, out_w = size
+    return int(out_h), int(out_w)
+def _normalize_resample(resample) -> str:
+    if isinstance(resample, str):
+        if resample not in ("bilinear", "bicubic"):
+            raise ValueError(f"resample must be 'bilinear' or 'bicubic', got {resample!r}")
+        return resample
+    interp = PIL_RESAMPLE_TO_INTERP.get(int(resample))
+    if interp is None:
+        raise ValueError(f"unsupported PIL resample code {resample}")
+    return interp
+def resize_normalize(
+    images,
+    size,
+    image_mean,
+    image_std,
+    rescale_factor: float = 1.0 / 255.0,
+    resample="bilinear",
+    antialias: bool = False,
+    backend: str = "separable",
+    block: int = 256,
+    crop_size=None,
+    resize_mode: str = "square",
+):
+    """Resize, optionally center-crop, rescale and normalize — one GPU pipeline.
+    Args:
+        images: a stacked `(N, C, H, W)` uint8/float tensor, or a list of CHW tensors (ragged).
+        size: resize target. With no crop: int (square), `(height, width)`, or `{"height","width"}`.
+            With `resize_mode="shortest_edge"`: an int, the short side after aspect-preserving resize.
+        image_mean, image_std: per-channel normalization stats (length C).
+        rescale_factor: folded into mean/std so the kernel does `(x*rescale - mean)/std`.
+        resample: "bilinear" / "bicubic", or a PIL resample int (0/2 -> bilinear, 3 -> bicubic).
+        antialias: match the ViT/CLIP/SigLIP default (`True` for those processors).
+        backend: "separable" (default) or "fused" (2D reference, no crop support).
+        crop_size: `None` (no crop), int (square), or `(crop_h, crop_w)`. Center crop after resize.
+        resize_mode: "square" (resize to `size`) or "shortest_edge" (aspect-preserving, needs a crop).
+    """
+    interp = _normalize_resample(resample)
+    image_list = as_image_list(images)
+    if crop_size is not None or resize_mode == "shortest_edge":
+        crop_h, crop_w = _normalize_size(crop_size if crop_size is not None else size)
+        resize_arg = int(size) if resize_mode == "shortest_edge" else _normalize_size(size)
+        return separable_resize_crop_normalize(
+            image_list, resize_arg, (crop_h, crop_w), image_mean, image_std, rescale_factor,
+            interp, antialias, resize_mode, block,
+        )
+    out_h, out_w = _normalize_size(size)
+    if backend == "fused":
+        return fused_resize_normalize(image_list, out_h, out_w, image_mean, image_std, rescale_factor, interp, antialias, block)
+    if backend == "separable":
+        return separable_resize_normalize(image_list, out_h, out_w, image_mean, image_std, rescale_factor, interp, antialias, block)
+    raise ValueError(f"backend must be 'fused' or 'separable', got {backend!r}")
+def resize_normalize_ragged(
+    images,
+    size,
+    image_mean,
+    image_std,
+    rescale_factor: float = 1.0 / 255.0,
+    resample="bilinear",
+    antialias: bool = False,
+    backend: str = "separable",
+    block: int = 256,
+):
+    """Variant taking a list of different-H/W CHW tensors. Same kernels as `resize_normalize`."""
+    if isinstance(images, list):
+        image_list = images
+    else:
+        raise ValueError("resize_normalize_ragged expects a list of CHW tensors; use resize_normalize for a stacked tensor")
+    return resize_normalize(
+        image_list, size, image_mean, image_std, rescale_factor, resample, antialias, backend, block
+    )
+__all__ = ["resize_normalize", "resize_normalize_ragged"]

build/torch-universal/kernel_image_resize/_fused.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""Fused 2D resize+rescale+normalize over a ragged batch, single launch.
+One program owns one image and a BLOCK of its output pixels, gathers a
+MAX_TAPS_H × MAX_TAPS_W window, applies the separable weights as a 2D product, then folds
+rescale+normalize. taps×taps loads per output pixel.
+Resampling-weight formula (PyTorch aten UpSampleKernel):
+    scale   = in / out
+    support = interp_half * (scale if antialias and scale > 1 else 1)   # interp_half: 1 linear, 2 cubic
+    center  = scale * (i + 0.5)
+    weight  = filter((tap - center + 0.5) / eff), renormalized over the realized window
+"""
+import triton
+import triton.language as tl
+from ._pack import fold_mean_std, max_taps, pack_images
+@triton.jit
+def _resample_weight(arg, cubic_a, CUBIC: tl.constexpr):
+    """Interpolation filter at `arg` (coordinate distance already divided by support)."""
+    ax = tl.abs(arg)
+    if CUBIC:  # Keys cubic convolution kernel, support 2
+        ax2 = ax * ax
+        ax3 = ax2 * ax
+        inner = (cubic_a + 2.0) * ax3 - (cubic_a + 3.0) * ax2 + 1.0  # |x| <= 1
+        outer = cubic_a * ax3 - 5.0 * cubic_a * ax2 + 8.0 * cubic_a * ax - 4.0 * cubic_a  # 1 < |x| < 2
+        return tl.where(ax <= 1.0, inner, tl.where(ax < 2.0, outer, 0.0))
+    return tl.maximum(1.0 - ax, 0.0)  # triangle (bilinear), support 1
+@triton.jit
+def _resize_normalize_kernel(
+    in_ptr, out_ptr, offsets_ptr, heights_ptr, widths_ptr, mean_ptr, std_ptr,
+    out_h, out_w, cubic_a,
+    C: tl.constexpr, BLOCK: tl.constexpr,
+    CUBIC: tl.constexpr, ANTIALIAS: tl.constexpr,
+    MAX_TAPS_H: tl.constexpr, MAX_TAPS_W: tl.constexpr,
+):
+    n = tl.program_id(0)
+    blk = tl.program_id(1)
+    H = tl.load(heights_ptr + n)
+    W = tl.load(widths_ptr + n)
+    off = tl.load(offsets_ptr + n)
+    Hf = H.to(tl.float32)
+    Wf = W.to(tl.float32)
+    npix = out_h * out_w
+    pos = blk * BLOCK + tl.arange(0, BLOCK)
+    mask = pos < npix
+    oy = pos // out_w
+    ox = pos % out_w
+    interp_half = 2.0 if CUBIC else 1.0
+    scale_h = Hf / out_h
+    scale_w = Wf / out_w
+    eff_h = tl.maximum(scale_h, 1.0) if ANTIALIAS else 1.0
+    eff_w = tl.maximum(scale_w, 1.0) if ANTIALIAS else 1.0
+    support_h = interp_half * eff_h
+    support_w = interp_half * eff_w
+    inv_h = 1.0 / eff_h
+    inv_w = 1.0 / eff_w
+    center_y = scale_h * (oy.to(tl.float32) + 0.5)
+    center_x = scale_w * (ox.to(tl.float32) + 0.5)
+    ystart = tl.floor(center_y - support_h + 0.5)
+    xstart = tl.floor(center_x - support_w + 0.5)
+    sum_wy = tl.zeros([BLOCK], dtype=tl.float32)
+    for ty in tl.static_range(MAX_TAPS_H):
+        yy = ystart + ty
+        wy = _resample_weight((yy - center_y + 0.5) * inv_h, cubic_a, CUBIC)
+        if ANTIALIAS:
+            wy = tl.where((yy >= 0.0) & (yy < Hf), wy, 0.0)
+        sum_wy += wy
+    sum_wx = tl.zeros([BLOCK], dtype=tl.float32)
+    for tx in tl.static_range(MAX_TAPS_W):
+        xx = xstart + tx
+        wx = _resample_weight((xx - center_x + 0.5) * inv_w, cubic_a, CUBIC)
+        if ANTIALIAS:
+            wx = tl.where((xx >= 0.0) & (xx < Wf), wx, 0.0)
+        sum_wx += wx
+    denom = sum_wy * sum_wx
+    plane = (H * W).to(tl.int64)
+    Wl = W.to(tl.int64)
+    for c in tl.static_range(C):
+        base = off + c * plane
+        acc = tl.zeros([BLOCK], dtype=tl.float32)
+        for ty in tl.static_range(MAX_TAPS_H):
+            yy = ystart + ty
+            wy = _resample_weight((yy - center_y + 0.5) * inv_h, cubic_a, CUBIC)
+            if ANTIALIAS:
+                wy = tl.where((yy >= 0.0) & (yy < Hf), wy, 0.0)
+            yidx = tl.minimum(tl.maximum(yy.to(tl.int32), 0), H - 1).to(tl.int64)
+            row = base + yidx * Wl
+            for tx in tl.static_range(MAX_TAPS_W):
+                xx = xstart + tx
+                wx = _resample_weight((xx - center_x + 0.5) * inv_w, cubic_a, CUBIC)
+                if ANTIALIAS:
+                    wx = tl.where((xx >= 0.0) & (xx < Wf), wx, 0.0)
+                xidx = tl.minimum(tl.maximum(xx.to(tl.int32), 0), W - 1).to(tl.int64)
+                pix = tl.load(in_ptr + row + xidx, mask=mask, other=0.0)
+                acc += wy * wx * pix
+        acc = acc / denom
+        m = tl.load(mean_ptr + c)
+        s = tl.load(std_ptr + c)
+        acc = (acc - m) / s
+        oidx = ((n * C + c) * out_h + oy) * out_w + ox
+        tl.store(out_ptr + oidx, acc, mask=mask)
+def fused_resize_normalize(images, out_h, out_w, mean, std, rescale, interp, antialias, block: int = 256):
+    """Single fused launch over a ragged packed buffer -> (N, C, out_h, out_w) normalized float."""
+    import torch
+    images = list(images)
+    device = images[0].device
+    n = len(images)
+    cubic_a = -0.5 if antialias else -0.75  # PIL coeff under antialias, Keys coeff otherwise
+    max_taps_h = max_taps(images, out_h, 1, interp, antialias)
+    max_taps_w = max_taps(images, out_w, 2, interp, antialias)
+    mean_t, std_t = fold_mean_std(mean, std, rescale, device)
+    in_buf, offsets_t, heights_t, widths_t, c = pack_images(images)
+    out = torch.empty((n, c, out_h, out_w), device=device, dtype=torch.float32)
+    grid = (n, triton.cdiv(out_h * out_w, block))
+    _resize_normalize_kernel[grid](
+        in_buf, out, offsets_t, heights_t, widths_t, mean_t, std_t,
+        out_h, out_w, cubic_a, C=c, BLOCK=block,
+        CUBIC=(interp == "bicubic"), ANTIALIAS=antialias, MAX_TAPS_H=max_taps_h, MAX_TAPS_W=max_taps_w,
+    )
+    return out

build/torch-universal/kernel_image_resize/_pack.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""Ragged packing + resampling helpers shared by the fused and separable backends."""
+import math
+import torch
+PIL_RESAMPLE_TO_INTERP = {0: "bilinear", 2: "bilinear", 3: "bicubic"}
+def pack_images(
+    images: list[torch.Tensor], dtype: torch.dtype = torch.float32
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, int]:
+    """Concatenate a ragged list of CHW images into one flat buffer of `dtype`.
+    Returns (in_buf, offsets, heights, widths, channels); offsets[n] is the element index
+    where image n starts.
+    """
+    device = images[0].device
+    channels = images[0].shape[0]
+    flats, offsets, heights, widths, cur = [], [], [], [], 0
+    for img in images:
+        ic, ih, iw = img.shape
+        if ic != channels:
+            raise ValueError(f"all images must share channel count {channels}, got {ic}")
+        flats.append(img.reshape(-1).to(dtype))
+        offsets.append(cur)
+        heights.append(ih)
+        widths.append(iw)
+        cur += ic * ih * iw
+    in_buf = torch.cat(flats)
+    offsets_t = torch.tensor(offsets, device=device, dtype=torch.int64)
+    heights_t = torch.tensor(heights, device=device, dtype=torch.int32)
+    widths_t = torch.tensor(widths, device=device, dtype=torch.int32)
+    return in_buf, offsets_t, heights_t, widths_t, channels
+def fold_mean_std(mean, std, rescale: float, device) -> tuple[torch.Tensor, torch.Tensor]:
+    """Fold rescale into mean/std so the kernel does (x - m)/s == (x*rescale - mean)/std."""
+    mean_t = (torch.tensor(mean, device=device, dtype=torch.float32) / rescale).contiguous()
+    std_t = (torch.tensor(std, device=device, dtype=torch.float32) / rescale).contiguous()
+    return mean_t, std_t
+def max_taps(images: list[torch.Tensor], out_size: int, axis_dim: int, interp: str, antialias: bool) -> int:
+    """Batch-wide worst-case tap count for one axis = ceil(support) * 2 + 1."""
+    interp_half = 2.0 if interp == "bicubic" else 1.0
+    worst = 0
+    for img in images:
+        scale = img.shape[axis_dim] / out_size
+        eff = max(scale, 1.0) if antialias else 1.0
+        worst = max(worst, math.ceil(interp_half * eff) * 2 + 1)
+    return worst
+def as_image_list(images) -> list[torch.Tensor]:
+    """Accept a stacked (N, C, H, W) tensor or a list of CHW tensors; always return a list."""
+    if isinstance(images, torch.Tensor):
+        if images.dim() != 4:
+            raise ValueError(f"stacked input must be (N, C, H, W), got shape {tuple(images.shape)}")
+        return list(images)
+    return list(images)

build/torch-universal/kernel_image_resize/_separable.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""Separable resize + center-crop + normalize over a ragged uint8 batch.
+WHAT "RESIZE" DOES, CONCRETELY
+Every output pixel is a weighted average of a small window of input pixels. When you shrink
+an image a lot (with antialiasing) that window gets wide — e.g. 13 input pixels across and
+13 down, so 13x13 = 169 input pixels feed one output pixel.
+FUSED vs SEPARABLE (the two backends in this package)
+- FUSED (see _fused.py): for each output pixel, read the whole 2D window directly -> 169 reads.
+- SEPARABLE (this file): do the resize as two 1D steps instead of one 2D step:
+    step 1 (horizontal): resize only the WIDTH  -> an intermediate image
+    step 2 (vertical):   resize only the HEIGHT -> the final image
+  Each step's window is 1D, so 13 + 13 = 26 reads per output pixel instead of 169. Same math,
+  far fewer reads. This is what PIL and torchvision do.
+CENTER CROP (folded in, no extra pass)
+Processors like CLIP / DINOv2 resize to a "resize size" and then keep only the centered
+crop. We do not materialize the full resized image and slice it; instead each output pixel
+of the CROP maps to a resize-image coordinate by adding the crop offset, and that maps back
+to the input. So:
+    resize is described by  (resize_height, resize_width)   -- per image
+    crop   is described by  (crop_top, crop_left)           -- per image, the centered offset
+    output size is          (crop_height, crop_width)        -- the same for every image
+When there is no crop, resize size == crop size and the offsets are 0 (the plain resize).
+The resize SCALE uses the resize size; only the output coordinate is shifted by the crop.
+uint8 input + float intermediate; each 1D step renormalizes its own weights (matches
+torchvision). Output is parity-close to torchvision, not bit-identical (torchvision keeps a
+fixed-point uint8 intermediate; ours is more accurate float).
+"""
+import triton
+import triton.language as tl
+from ._fused import _resample_weight
+from ._pack import fold_mean_std, pack_images
+@triton.jit
+def _horizontal_resize_kernel(
+    input_pixels,          # flat uint8 buffer, all images packed back to back
+    intermediate,          # flat float32 output: width resized + col-cropped, height untouched
+    input_offsets,         # input_offsets[image] = where that image starts in input_pixels
+    intermediate_offsets,  # same idea for the intermediate buffer
+    heights, widths,       # per-image input height / width
+    resize_widths,         # per-image width to resize to (before cropping)
+    crop_lefts,            # per-image left offset of the centered crop
+    crop_w,                # output (crop) width, same for every image
+    cubic_coeff,
+    CHANNELS: tl.constexpr, BLOCK: tl.constexpr,
+    CUBIC: tl.constexpr, ANTIALIAS: tl.constexpr,
+    MAX_TAPS_COL: tl.constexpr,
+):
+    """Resize width to resize_width, keep only the cropped columns: uint8 (C,H,W) -> float (C,H,crop_w)."""
+    image_index = tl.program_id(0)
+    block_index = tl.program_id(1)
+    in_height = tl.load(heights + image_index)
+    in_width = tl.load(widths + image_index)
+    resize_width = tl.load(resize_widths + image_index)
+    crop_left = tl.load(crop_lefts + image_index)
+    input_start = tl.load(input_offsets + image_index)
+    intermediate_start = tl.load(intermediate_offsets + image_index)
+    in_width_f = in_width.to(tl.float32)
+    num_pixels = in_height * crop_w  # every input row x every cropped output column
+    flat_index = block_index * BLOCK + tl.arange(0, BLOCK)
+    active = flat_index < num_pixels
+    input_row = flat_index // crop_w
+    out_col = flat_index % crop_w
+    resize_col = out_col + crop_left  # column in the (uncropped) resized image
+    filter_half = 2.0 if CUBIC else 1.0
+    col_scale = in_width_f / resize_width.to(tl.float32)
+    col_filter_scale = tl.maximum(col_scale, 1.0) if ANTIALIAS else 1.0
+    col_support = filter_half * col_filter_scale
+    col_inv_scale = 1.0 / col_filter_scale
+    src_center_col = col_scale * (resize_col.to(tl.float32) + 0.5)
+    first_tap_col = tl.floor(src_center_col - col_support + 0.5)
+    col_weight_sum = tl.zeros([BLOCK], dtype=tl.float32)
+    for tap in tl.static_range(MAX_TAPS_COL):
+        tap_col = first_tap_col + tap
+        weight = _resample_weight((tap_col - src_center_col + 0.5) * col_inv_scale, cubic_coeff, CUBIC)
+        if ANTIALIAS:
+            weight = tl.where((tap_col >= 0.0) & (tap_col < in_width_f), weight, 0.0)
+        col_weight_sum += weight
+    input_plane = (in_height * in_width).to(tl.int64)
+    intermediate_plane = (in_height * crop_w).to(tl.int64)
+    in_width_i64 = in_width.to(tl.int64)
+    crop_w_i64 = crop_w.to(tl.int64)
+    input_row_i64 = input_row.to(tl.int64)
+    for channel in tl.static_range(CHANNELS):
+        input_row_base = input_start + channel * input_plane + input_row_i64 * in_width_i64
+        accumulator = tl.zeros([BLOCK], dtype=tl.float32)
+        for tap in tl.static_range(MAX_TAPS_COL):
+            tap_col = first_tap_col + tap
+            weight = _resample_weight((tap_col - src_center_col + 0.5) * col_inv_scale, cubic_coeff, CUBIC)
+            if ANTIALIAS:
+                weight = tl.where((tap_col >= 0.0) & (tap_col < in_width_f), weight, 0.0)
+            clamped_tap_col = tl.minimum(tl.maximum(tap_col.to(tl.int32), 0), in_width - 1).to(tl.int64)
+            pixel = tl.load(input_pixels + input_row_base + clamped_tap_col, mask=active, other=0).to(tl.float32)
+            accumulator += weight * pixel
+        accumulator = accumulator / col_weight_sum
+        write_index = intermediate_start + channel * intermediate_plane + input_row_i64 * crop_w_i64 + out_col
+        tl.store(intermediate + write_index, accumulator, mask=active)
+@triton.jit
+def _vertical_resize_normalize_kernel(
+    intermediate,          # float32 from the horizontal step: (C, H, crop_w) per image
+    output,                # final (N, C, crop_h, crop_w) float32
+    intermediate_offsets,
+    heights,               # per-image input height (the intermediate still has H rows)
+    resize_heights,        # per-image height to resize to (before cropping)
+    crop_tops,             # per-image top offset of the centered crop
+    means, stds,           # per-channel normalization, rescale already folded in
+    crop_h, crop_w,
+    cubic_coeff,
+    CHANNELS: tl.constexpr, BLOCK: tl.constexpr,
+    CUBIC: tl.constexpr, ANTIALIAS: tl.constexpr,
+    MAX_TAPS_ROW: tl.constexpr,
+):
+    """Resize height to resize_height, keep cropped rows, normalize: float (C,H,crop_w) -> (C,crop_h,crop_w)."""
+    image_index = tl.program_id(0)
+    block_index = tl.program_id(1)
+    in_height = tl.load(heights + image_index)
+    resize_height = tl.load(resize_heights + image_index)
+    crop_top = tl.load(crop_tops + image_index)
+    intermediate_start = tl.load(intermediate_offsets + image_index)
+    in_height_f = in_height.to(tl.float32)
+    num_pixels = crop_h * crop_w
+    flat_index = block_index * BLOCK + tl.arange(0, BLOCK)
+    active = flat_index < num_pixels
+    out_row = flat_index // crop_w
+    out_col = flat_index % crop_w
+    resize_row = out_row + crop_top  # row in the (uncropped) resized image
+    filter_half = 2.0 if CUBIC else 1.0
+    row_scale = in_height_f / resize_height.to(tl.float32)
+    row_filter_scale = tl.maximum(row_scale, 1.0) if ANTIALIAS else 1.0
+    row_support = filter_half * row_filter_scale
+    row_inv_scale = 1.0 / row_filter_scale
+    src_center_row = row_scale * (resize_row.to(tl.float32) + 0.5)
+    first_tap_row = tl.floor(src_center_row - row_support + 0.5)
+    row_weight_sum = tl.zeros([BLOCK], dtype=tl.float32)
+    for tap in tl.static_range(MAX_TAPS_ROW):
+        tap_row = first_tap_row + tap
+        weight = _resample_weight((tap_row - src_center_row + 0.5) * row_inv_scale, cubic_coeff, CUBIC)
+        if ANTIALIAS:
+            weight = tl.where((tap_row >= 0.0) & (tap_row < in_height_f), weight, 0.0)
+        row_weight_sum += weight
+    intermediate_plane = (in_height * crop_w).to(tl.int64)
+    crop_w_i64 = crop_w.to(tl.int64)
+    out_col_i64 = out_col.to(tl.int64)
+    for channel in tl.static_range(CHANNELS):
+        channel_base = intermediate_start + channel * intermediate_plane
+        accumulator = tl.zeros([BLOCK], dtype=tl.float32)
+        for tap in tl.static_range(MAX_TAPS_ROW):
+            tap_row = first_tap_row + tap
+            weight = _resample_weight((tap_row - src_center_row + 0.5) * row_inv_scale, cubic_coeff, CUBIC)
+            if ANTIALIAS:
+                weight = tl.where((tap_row >= 0.0) & (tap_row < in_height_f), weight, 0.0)
+            clamped_tap_row = tl.minimum(tl.maximum(tap_row.to(tl.int32), 0), in_height - 1).to(tl.int64)
+            pixel = tl.load(intermediate + channel_base + clamped_tap_row * crop_w_i64 + out_col_i64, mask=active, other=0.0)
+            accumulator += weight * pixel
+        accumulator = accumulator / row_weight_sum
+        mean = tl.load(means + channel)
+        std = tl.load(stds + channel)
+        accumulator = (accumulator - mean) / std
+        write_index = ((image_index * CHANNELS + channel) * crop_h + out_row) * crop_w + out_col
+        tl.store(output + write_index, accumulator, mask=active)
+def _axis_max_taps(in_sizes, resize_sizes, interp, antialias):
+    """Widest 1D window over the batch for one axis = ceil(support) * 2 + 1, support uses in/resize."""
+    import math
+    interp_half = 2.0 if interp == "bicubic" else 1.0
+    worst = 0
+    for in_size, resize_size in zip(in_sizes, resize_sizes):
+        scale = in_size / resize_size
+        eff = max(scale, 1.0) if antialias else 1.0
+        worst = max(worst, math.ceil(interp_half * eff) * 2 + 1)
+    return worst
+def _run_separable(images, resize_heights, resize_widths, crop_tops, crop_lefts, crop_h, crop_w,
+                   mean, std, rescale, interp, antialias, block):
+    """Core driver: resize each image to its (resize_h, resize_w), keep the centered crop, normalize."""
+    import torch
+    device = images[0].device
+    num_images = len(images)
+    cubic_coeff = -0.5 if antialias else -0.75
+    in_heights = [int(img.shape[1]) for img in images]
+    in_widths = [int(img.shape[2]) for img in images]
+    max_taps_row = _axis_max_taps(in_heights, resize_heights, interp, antialias)
+    max_taps_col = _axis_max_taps(in_widths, resize_widths, interp, antialias)
+    means, stds = fold_mean_std(mean, std, rescale, device)
+    input_pixels, input_offsets, heights, widths, channels = pack_images(images, dtype=torch.uint8)
+    intermediate_offsets_list, cursor, tallest = [], 0, 0
+    for height in in_heights:
+        intermediate_offsets_list.append(cursor)
+        cursor += channels * height * crop_w
+        tallest = max(tallest, height)
+    intermediate = torch.empty(cursor, device=device, dtype=torch.float32)
+    intermediate_offsets = torch.tensor(intermediate_offsets_list, device=device, dtype=torch.int64)
+    resize_heights_t = torch.tensor(resize_heights, device=device, dtype=torch.int32)
+    resize_widths_t = torch.tensor(resize_widths, device=device, dtype=torch.int32)
+    crop_tops_t = torch.tensor(crop_tops, device=device, dtype=torch.int32)
+    crop_lefts_t = torch.tensor(crop_lefts, device=device, dtype=torch.int32)
+    horizontal_grid = (num_images, triton.cdiv(tallest * crop_w, block))
+    _horizontal_resize_kernel[horizontal_grid](
+        input_pixels, intermediate, input_offsets, intermediate_offsets, heights, widths,
+        resize_widths_t, crop_lefts_t, crop_w, cubic_coeff,
+        CHANNELS=channels, BLOCK=block, CUBIC=(interp == "bicubic"), ANTIALIAS=antialias,
+        MAX_TAPS_COL=max_taps_col,
+    )
+    output = torch.empty((num_images, channels, crop_h, crop_w), device=device, dtype=torch.float32)
+    vertical_grid = (num_images, triton.cdiv(crop_h * crop_w, block))
+    _vertical_resize_normalize_kernel[vertical_grid](
+        intermediate, output, intermediate_offsets, heights, resize_heights_t, crop_tops_t, means, stds,
+        crop_h, crop_w, cubic_coeff,
+        CHANNELS=channels, BLOCK=block, CUBIC=(interp == "bicubic"), ANTIALIAS=antialias,
+        MAX_TAPS_ROW=max_taps_row,
+    )
+    return output
+def _aspect_preserving_size(in_h, in_w, shortest_edge):
+    """transformers shortest-edge rule: short side -> shortest_edge, long side truncated (int(), not round)."""
+    if in_h <= in_w:
+        return shortest_edge, int(in_w * shortest_edge / in_h)
+    return int(in_h * shortest_edge / in_w), shortest_edge
+def separable_resize_normalize(images, out_h, out_w, mean, std, rescale, interp, antialias, block: int = 256):
+    """Resize to (out_h, out_w) and normalize (no crop)."""
+    images = list(images)
+    n = len(images)
+    return _run_separable(images, [out_h] * n, [out_w] * n, [0] * n, [0] * n, out_h, out_w,
+                          mean, std, rescale, interp, antialias, block)
+def separable_resize_crop_normalize(images, resize_size, crop_size, mean, std, rescale, interp, antialias,
+                                    resize_mode="square", block: int = 256):
+    """Resize then center-crop then normalize.
+    resize_mode="square": resize_size is (resize_h, resize_w) applied to every image.
+    resize_mode="shortest_edge": resize_size is an int; each image is resized aspect-preserving
+    so its short side equals it, then center-cropped to crop_size.
+    """
+    images = list(images)
+    crop_h, crop_w = crop_size
+    resize_heights, resize_widths = [], []
+    for img in images:
+        in_h, in_w = int(img.shape[1]), int(img.shape[2])
+        if resize_mode == "shortest_edge":
+            rh, rw = _aspect_preserving_size(in_h, in_w, int(resize_size))
+        elif resize_mode == "square":
+            rh, rw = int(resize_size[0]), int(resize_size[1])
+        else:
+            raise ValueError(f"resize_mode must be 'square' or 'shortest_edge', got {resize_mode!r}")
+        if rh < crop_h or rw < crop_w:
+            raise ValueError(f"resize size ({rh},{rw}) smaller than crop ({crop_h},{crop_w})")
+        resize_heights.append(rh)
+        resize_widths.append(rw)
+    crop_tops = [(rh - crop_h) // 2 for rh in resize_heights]
+    crop_lefts = [(rw - crop_w) // 2 for rw in resize_widths]
+    return _run_separable(images, resize_heights, resize_widths, crop_tops, crop_lefts, crop_h, crop_w,
+                          mean, std, rescale, interp, antialias, block)

example.py ADDED Viewed

	@@ -0,0 +1,32 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "torch",
+#     "triton",
+#     "kernels",
+# ]
+# ///
+"""Minimal smoke test of the published kernel via get_kernel (run on a CUDA box)."""
+import torch
+from kernels import get_kernel
+kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+images = [
+    torch.randint(0, 256, (3, h, w), dtype=torch.uint8, device=device)
+    for h, w in [(640, 480), (800, 600), (384, 1024)]
+]
+pixel_values = kir.resize_normalize(
+    images,
+    size=384,
+    image_mean=[0.5, 0.5, 0.5],
+    image_std=[0.5, 0.5, 0.5],
+    rescale_factor=1 / 255,
+    resample="bicubic",
+    antialias=True,
+)
+print(f"{len(images)} ragged images -> {tuple(pixel_values.shape)} {pixel_values.dtype}")

example_transformers.py ADDED Viewed

	@@ -0,0 +1,85 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = ["torch", "triton", "kernels", "transformers", "torchvision"]
+# ///
+"""Drop-in: use the kernel as the resize+normalize stage of a transformers fast processor.
+There is no `use_kernels=True` hook for image processors (that machinery swaps nn.Module
+layer forwards inside the model, not processor code). So the usable path is to read the
+processor's config and call the kernel directly. `preprocess_with_kernel` below is the whole
+adapter — copy it into your code.
+Run on a CUDA box:
+    python example_transformers.py
+"""
+import torch
+from kernels import get_kernel
+from transformers import AutoImageProcessor, AutoModel
+kernel_image_resize = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True)
+_PIL_RESAMPLE = {0: "bilinear", 2: "bilinear", 3: "bicubic"}
+def preprocess_with_kernel(processor, images):
+    """Run the kernel using `processor`'s own config; returns pixel_values like processor(images).
+    Handles fixed-size resize, square-resize + center-crop, and shortest-edge resize + center-crop
+    (CLIP / DINOv2). Does not handle padding processors.
+    """
+    size = processor.size
+    if getattr(processor, "do_pad", False):
+        raise ValueError("kernel does not pad; this processor needs a pad step")
+    if not getattr(processor, "do_normalize", True):
+        raise ValueError("processor does not normalize (rescale only); kernel always normalizes")
+    if getattr(processor, "do_flip_channel_order", False):
+        raise ValueError("processor flips channels to BGR; kernel keeps RGB")
+    resample = _PIL_RESAMPLE[int(processor.resample)]
+    antialias = bool(getattr(processor, "antialias", True))
+    rescale = float(processor.rescale_factor) if getattr(processor, "do_rescale", True) else 1.0
+    mean, std = processor.image_mean, processor.image_std
+    crop = processor.crop_size if getattr(processor, "do_center_crop", False) else None
+    common = dict(rescale_factor=rescale, resample=resample, antialias=antialias)
+    if "shortest_edge" in size:
+        if crop is None:
+            raise ValueError("shortest-edge resize without a crop gives variable-size output")
+        return kernel_image_resize.resize_normalize(
+            images, size["shortest_edge"], mean, std,
+            crop_size=(crop["height"], crop["width"]), resize_mode="shortest_edge", **common,
+        )
+    if crop is not None and (crop["height"] != size["height"] or crop["width"] != size["width"]):
+        return kernel_image_resize.resize_normalize(
+            images, (size["height"], size["width"]), mean, std,
+            crop_size=(crop["height"], crop["width"]), resize_mode="square", **common,
+        )
+    return kernel_image_resize.resize_normalize(images, (size["height"], size["width"]), mean, std, **common)
+def main():
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model_id = "google/siglip2-base-patch16-224"
+    processor = AutoImageProcessor.from_pretrained(model_id, backend="torchvision")
+    model = AutoModel.from_pretrained(model_id).to(device).eval()
+    images = [
+        torch.randint(0, 256, (3, h, w), dtype=torch.uint8, device=device)
+        for h, w in [(640, 480), (800, 600), (384, 1024)]
+    ]
+    pixel_values = preprocess_with_kernel(processor, images)
+    print(f"{len(images)} ragged images -> pixel_values {tuple(pixel_values.shape)} {pixel_values.dtype}")
+    with torch.no_grad():
+        features = model.vision_model(pixel_values=pixel_values.to(model.dtype)).pooler_output
+    print(f"vision features: {tuple(features.shape)}")
+    # parity vs the real processor (float-vs-uint8 resize -> small, expected gap)
+    reference = processor(images, return_tensors="pt", device=device)["pixel_values"].to(device)
+    print(f"max|Δ| pixel_values vs processor: {(pixel_values - reference).abs().max().item():.2e}")
+if __name__ == "__main__":
+    main()

publish.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/usr/bin/env bash
+# Publish the universal kernel to the Hub as a `kernel` repo type.
+#
+# IMPORTANT: kernel repos are repo_type="kernel" (served under /api/kernels/), NOT model repos.
+# get_kernel() queries repo_type="kernel", so uploading with --repo-type model gives a 404.
+#
+# A universal kernel needs no compilation: the build is just a copy of the source package into the
+# variant directory get_kernel resolves (build/torch-universal/<name>/).
+set -euo pipefail
+REPO_ID="${1:-Molbap/kernel_image_resize}"
+NAME="kernel_image_resize"
+HERE="$(cd "$(dirname "$0")" && pwd)"
+rm -rf "$HERE/build"
+mkdir -p "$HERE/build/torch-universal"
+cp -r "$HERE/torch-ext/$NAME" "$HERE/build/torch-universal/$NAME"
+find "$HERE/build" -name __pycache__ -type d -exec rm -rf {} + 2>/dev/null || true
+echo "built build/torch-universal/$NAME"
+# Create the kernel repo and upload. Uses the Python API because it reliably accepts
+# repo_type="kernel" (the hf CLI's repo-type choices can be stricter).
+python - "$REPO_ID" "$HERE" <<'PY'
+import sys
+# Some huggingface_hub versions know the `kernel` repo type (constant + create_repo) but still
+# validate upload_folder against the old REPO_TYPES list. Register it so the upload validator passes.
+import huggingface_hub.constants as hfc
+if "kernel" not in hfc.REPO_TYPES:
+    hfc.REPO_TYPES = list(hfc.REPO_TYPES) + ["kernel"]
+from huggingface_hub import create_repo, upload_folder
+repo_id, folder = sys.argv[1], sys.argv[2]
+create_repo(repo_id, repo_type="kernel", exist_ok=True)
+upload_folder(
+    repo_id=repo_id,
+    repo_type="kernel",
+    folder_path=folder,
+    ignore_patterns=["__pycache__/*", "*.pyc", "result", ".git/*", "build/torch-universal/*/__pycache__/*"],
+)
+print(f"uploaded {folder} -> {repo_id} (repo_type=kernel)")
+PY

resultcompat ADDED Viewed

	@@ -0,0 +1,16 @@

+model                                          verdict    pixel max|Δ|   feature max|Δ| (rel)
+openai/clip-vit-base-patch32                   OK         7.53e-03      1.53e-02 (0.2%)
+google/vit-base-patch16-224                    OK         3.93e-03      6.87e-03 (0.8%)
+apple/mobilevit-small                          SKIP: no normalize (rescale only)
+facebook/dinov2-small                          OK         1.41e-01      1.91e-02 (0.2%)
+google/siglip-so400m-patch14-384               OK         1.58e-01      9.28e-03 (0.1%)
+facebook/dinov3-vitb16-pretrain-lvd1689m       OK         4.99e-05      8.08e-05 (0.0%)
+microsoft/swinv2-tiny-patch4-window16-256      OK         8.75e-03      1.27e-02 (0.6%)
+google/siglip2-base-patch16-224                OK         3.93e-03      1.31e-02 (0.2%)
+microsoft/resnet-50                            SKIP: shortest_edge without crop (variable output)
+nvidia/segformer-b0-finetuned-ade-512-512      OK         8.75e-03      4.33e-02 (0.4%)
+facebook/convnextv2-tiny-22k-384               SKIP: shortest_edge without crop (variable output)
+google/mobilenet_v2_1.0_224                    OK         3.92e-03      3.90e-02 (0.7%)
+facebook/convnext-tiny-224                     SKIP: shortest_edge without crop (variable output)
+google/efficientnet-b0                         SKIP: resample 0
+microsoft/beit-base-patch16-224-pt22k-ft22k    OK         3.93e-03      3.19e-02 (0.9%)

tests/test_resize_normalize.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""Parity tests vs torchvision for both backends, all interp×antialias combos, ragged inputs.
+Run locally from the repo root with the package on the path:
+    PYTHONPATH=torch-ext pytest tests/ -q
+CUDA is required (Triton); tests skip on CPU.
+"""
+import pytest
+import torch
+import torchvision.transforms.v2.functional as tvF
+from torchvision.transforms import InterpolationMode
+from kernel_image_resize import resize_normalize
+_TV_INTERP = {"bilinear": InterpolationMode.BILINEAR, "bicubic": InterpolationMode.BICUBIC}
+MEAN = [0.48145466, 0.4578275, 0.40821073]
+STD = [0.26862954, 0.26130258, 0.27577711]
+RESCALE = 1.0 / 255.0
+def _ragged_images(n, device, min_res=384, max_res=1024, seed=0):
+    g = torch.Generator(device="cpu").manual_seed(seed)
+    images = []
+    for _ in range(n):
+        h = int(torch.randint(min_res, max_res + 1, (1,), generator=g).item())
+        w = int(torch.randint(min_res, max_res + 1, (1,), generator=g).item())
+        images.append(torch.randint(0, 256, (3, h, w), generator=g, dtype=torch.uint8).to(device))
+    return images
+def _torchvision_reference(images, out_h, out_w, interp, antialias):
+    mode = _TV_INTERP[interp]
+    mean = torch.tensor(MEAN, device=images[0].device).view(3, 1, 1)
+    std = torch.tensor(STD, device=images[0].device).view(3, 1, 1)
+    outs = []
+    for img in images:
+        r = tvF.resize(img.float(), [out_h, out_w], interpolation=mode, antialias=antialias)
+        outs.append((r * RESCALE - mean) / std)
+    return torch.stack(outs)
+@pytest.mark.kernels_ci
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="Triton kernel needs CUDA")
+@pytest.mark.parametrize("backend", ["fused", "separable"])
+@pytest.mark.parametrize("interp,antialias", [("bilinear", False), ("bilinear", True), ("bicubic", False), ("bicubic", True)])
+def test_parity_vs_torchvision(backend, interp, antialias):
+    device = torch.device("cuda")
+    images = _ragged_images(8, device)
+    out_h = out_w = 384
+    got = resize_normalize(
+        images, (out_h, out_w), MEAN, STD, RESCALE, resample=interp, antialias=antialias, backend=backend
+    )
+    ref = _torchvision_reference(images, out_h, out_w, interp, antialias)
+    max_abs = (got - ref).abs().max().item()
+    assert max_abs < 3e-3, f"{backend}/{interp}/aa={antialias}: max|Δ|={max_abs:.2e}"
+@pytest.mark.kernels_ci
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="Triton kernel needs CUDA")
+def test_stacked_tensor_input():
+    device = torch.device("cuda")
+    images = torch.randint(0, 256, (4, 3, 512, 512), dtype=torch.uint8, device=device)
+    got = resize_normalize(images, 224, MEAN, STD, RESCALE, resample="bicubic", antialias=True)
+    assert got.shape == (4, 3, 224, 224)
+def _shortest_edge_crop_reference(images, shortest_edge, crop, interp, antialias):
+    mode = _TV_INTERP[interp]
+    mean = torch.tensor(MEAN, device=images[0].device).view(3, 1, 1)
+    std = torch.tensor(STD, device=images[0].device).view(3, 1, 1)
+    outs = []
+    for img in images:
+        in_h, in_w = img.shape[1], img.shape[2]
+        if in_h <= in_w:
+            rh, rw = shortest_edge, int(in_w * shortest_edge / in_h)
+        else:
+            rh, rw = int(in_h * shortest_edge / in_w), shortest_edge
+        r = tvF.resize(img.float(), [rh, rw], interpolation=mode, antialias=antialias)
+        r = tvF.center_crop(r, [crop, crop])
+        outs.append((r * RESCALE - mean) / std)
+    return torch.stack(outs)
+@pytest.mark.kernels_ci
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="Triton kernel needs CUDA")
+@pytest.mark.parametrize("interp,antialias", [("bilinear", True), ("bicubic", True)])
+def test_shortest_edge_crop_parity(interp, antialias):
+    device = torch.device("cuda")
+    images = _ragged_images(8, device)
+    shortest_edge, crop = 256, 224
+    got = resize_normalize(
+        images, shortest_edge, MEAN, STD, RESCALE, resample=interp, antialias=antialias,
+        crop_size=(crop, crop), resize_mode="shortest_edge",
+    )
+    ref = _shortest_edge_crop_reference(images, shortest_edge, crop, interp, antialias)
+    assert got.shape == (8, 3, crop, crop)
+    max_abs = (got - ref).abs().max().item()
+    assert max_abs < 3e-3, f"shortest_edge+crop {interp}/aa={antialias}: max|Δ|={max_abs:.2e}"
+@pytest.mark.kernels_ci
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="Triton kernel needs CUDA")
+def test_fused_matches_separable():
+    device = torch.device("cuda")
+    images = _ragged_images(6, device)
+    common = dict(size=(256, 256), image_mean=MEAN, image_std=STD, rescale_factor=RESCALE, resample="bicubic", antialias=True)
+    fused = resize_normalize(images, backend="fused", **common)
+    separable = resize_normalize(images, backend="separable", **common)
+    max_abs = (fused - separable).abs().max().item()
+    assert max_abs < 3e-3, f"fused vs separable: max|Δ|={max_abs:.2e}"

torch-ext/kernel_image_resize/__init__.py ADDED Viewed

	@@ -0,0 +1,113 @@

+"""Resize + rescale + normalize for transformers fast image processors, as a Triton kernel.
+resize -> fold(rescale, normalize) in one GPU pipeline: CHW uint8 images in,
+(N, C, out_h, out_w) normalized float out, no full-resolution float intermediate.
+- resize_normalize        — stacked (N, C, H, W) tensor or a list of CHW images.
+- resize_normalize_ragged — same kernels; takes a list of different-H/W CHW tensors.
+backend="separable" (default): two-pass uint8, taps+taps. backend="fused": single 2D
+launch, taps*taps. Both parity <=1e-4 vs torchvision-float.
+    from kernels import get_kernel
+    kir = get_kernel("Molbap/kernel_image_resize")
+    pixel_values = kir.resize_normalize(
+        images, size=384, image_mean=[...], image_std=[...], resample="bicubic", antialias=True,
+    )
+"""
+from ._fused import fused_resize_normalize
+from ._pack import PIL_RESAMPLE_TO_INTERP, as_image_list
+from ._separable import separable_resize_crop_normalize, separable_resize_normalize
+def _normalize_size(size) -> tuple[int, int]:
+    if isinstance(size, int):
+        return size, size
+    if isinstance(size, dict):
+        if "height" in size and "width" in size:
+            return int(size["height"]), int(size["width"])
+        raise ValueError(f"size dict must hold 'height'/'width' for a fixed resize, got {size}")
+    out_h, out_w = size
+    return int(out_h), int(out_w)
+def _normalize_resample(resample) -> str:
+    if isinstance(resample, str):
+        if resample not in ("bilinear", "bicubic"):
+            raise ValueError(f"resample must be 'bilinear' or 'bicubic', got {resample!r}")
+        return resample
+    interp = PIL_RESAMPLE_TO_INTERP.get(int(resample))
+    if interp is None:
+        raise ValueError(f"unsupported PIL resample code {resample}")
+    return interp
+def resize_normalize(
+    images,
+    size,
+    image_mean,
+    image_std,
+    rescale_factor: float = 1.0 / 255.0,
+    resample="bilinear",
+    antialias: bool = False,
+    backend: str = "separable",
+    block: int = 256,
+    crop_size=None,
+    resize_mode: str = "square",
+):
+    """Resize, optionally center-crop, rescale and normalize — one GPU pipeline.
+    Args:
+        images: a stacked `(N, C, H, W)` uint8/float tensor, or a list of CHW tensors (ragged).
+        size: resize target. With no crop: int (square), `(height, width)`, or `{"height","width"}`.
+            With `resize_mode="shortest_edge"`: an int, the short side after aspect-preserving resize.
+        image_mean, image_std: per-channel normalization stats (length C).
+        rescale_factor: folded into mean/std so the kernel does `(x*rescale - mean)/std`.
+        resample: "bilinear" / "bicubic", or a PIL resample int (0/2 -> bilinear, 3 -> bicubic).
+        antialias: match the ViT/CLIP/SigLIP default (`True` for those processors).
+        backend: "separable" (default) or "fused" (2D reference, no crop support).
+        crop_size: `None` (no crop), int (square), or `(crop_h, crop_w)`. Center crop after resize.
+        resize_mode: "square" (resize to `size`) or "shortest_edge" (aspect-preserving, needs a crop).
+    """
+    interp = _normalize_resample(resample)
+    image_list = as_image_list(images)
+    if crop_size is not None or resize_mode == "shortest_edge":
+        crop_h, crop_w = _normalize_size(crop_size if crop_size is not None else size)
+        resize_arg = int(size) if resize_mode == "shortest_edge" else _normalize_size(size)
+        return separable_resize_crop_normalize(
+            image_list, resize_arg, (crop_h, crop_w), image_mean, image_std, rescale_factor,
+            interp, antialias, resize_mode, block,
+        )
+    out_h, out_w = _normalize_size(size)
+    if backend == "fused":
+        return fused_resize_normalize(image_list, out_h, out_w, image_mean, image_std, rescale_factor, interp, antialias, block)
+    if backend == "separable":
+        return separable_resize_normalize(image_list, out_h, out_w, image_mean, image_std, rescale_factor, interp, antialias, block)
+    raise ValueError(f"backend must be 'fused' or 'separable', got {backend!r}")
+def resize_normalize_ragged(
+    images,
+    size,
+    image_mean,
+    image_std,
+    rescale_factor: float = 1.0 / 255.0,
+    resample="bilinear",
+    antialias: bool = False,
+    backend: str = "separable",
+    block: int = 256,
+):
+    """Variant taking a list of different-H/W CHW tensors. Same kernels as `resize_normalize`."""
+    if isinstance(images, list):
+        image_list = images
+    else:
+        raise ValueError("resize_normalize_ragged expects a list of CHW tensors; use resize_normalize for a stacked tensor")
+    return resize_normalize(
+        image_list, size, image_mean, image_std, rescale_factor, resample, antialias, backend, block
+    )
+__all__ = ["resize_normalize", "resize_normalize_ragged"]

torch-ext/kernel_image_resize/_fused.py ADDED Viewed

	@@ -0,0 +1,134 @@

+"""Fused 2D resize+rescale+normalize over a ragged batch, single launch.
+One program owns one image and a BLOCK of its output pixels, gathers a
+MAX_TAPS_H × MAX_TAPS_W window, applies the separable weights as a 2D product, then folds
+rescale+normalize. taps×taps loads per output pixel.
+Resampling-weight formula (PyTorch aten UpSampleKernel):
+    scale   = in / out
+    support = interp_half * (scale if antialias and scale > 1 else 1)   # interp_half: 1 linear, 2 cubic
+    center  = scale * (i + 0.5)
+    weight  = filter((tap - center + 0.5) / eff), renormalized over the realized window
+"""
+import triton
+import triton.language as tl
+from ._pack import fold_mean_std, max_taps, pack_images
+@triton.jit
+def _resample_weight(arg, cubic_a, CUBIC: tl.constexpr):
+    """Interpolation filter at `arg` (coordinate distance already divided by support)."""
+    ax = tl.abs(arg)
+    if CUBIC:  # Keys cubic convolution kernel, support 2
+        ax2 = ax * ax
+        ax3 = ax2 * ax
+        inner = (cubic_a + 2.0) * ax3 - (cubic_a + 3.0) * ax2 + 1.0  # |x| <= 1
+        outer = cubic_a * ax3 - 5.0 * cubic_a * ax2 + 8.0 * cubic_a * ax - 4.0 * cubic_a  # 1 < |x| < 2
+        return tl.where(ax <= 1.0, inner, tl.where(ax < 2.0, outer, 0.0))
+    return tl.maximum(1.0 - ax, 0.0)  # triangle (bilinear), support 1
+@triton.jit
+def _resize_normalize_kernel(
+    in_ptr, out_ptr, offsets_ptr, heights_ptr, widths_ptr, mean_ptr, std_ptr,
+    out_h, out_w, cubic_a,
+    C: tl.constexpr, BLOCK: tl.constexpr,
+    CUBIC: tl.constexpr, ANTIALIAS: tl.constexpr,
+    MAX_TAPS_H: tl.constexpr, MAX_TAPS_W: tl.constexpr,
+):
+    n = tl.program_id(0)
+    blk = tl.program_id(1)
+    H = tl.load(heights_ptr + n)
+    W = tl.load(widths_ptr + n)
+    off = tl.load(offsets_ptr + n)
+    Hf = H.to(tl.float32)
+    Wf = W.to(tl.float32)
+    npix = out_h * out_w
+    pos = blk * BLOCK + tl.arange(0, BLOCK)
+    mask = pos < npix
+    oy = pos // out_w
+    ox = pos % out_w
+    interp_half = 2.0 if CUBIC else 1.0
+    scale_h = Hf / out_h
+    scale_w = Wf / out_w
+    eff_h = tl.maximum(scale_h, 1.0) if ANTIALIAS else 1.0
+    eff_w = tl.maximum(scale_w, 1.0) if ANTIALIAS else 1.0
+    support_h = interp_half * eff_h
+    support_w = interp_half * eff_w
+    inv_h = 1.0 / eff_h
+    inv_w = 1.0 / eff_w
+    center_y = scale_h * (oy.to(tl.float32) + 0.5)
+    center_x = scale_w * (ox.to(tl.float32) + 0.5)
+    ystart = tl.floor(center_y - support_h + 0.5)
+    xstart = tl.floor(center_x - support_w + 0.5)
+    sum_wy = tl.zeros([BLOCK], dtype=tl.float32)
+    for ty in tl.static_range(MAX_TAPS_H):
+        yy = ystart + ty
+        wy = _resample_weight((yy - center_y + 0.5) * inv_h, cubic_a, CUBIC)
+        if ANTIALIAS:
+            wy = tl.where((yy >= 0.0) & (yy < Hf), wy, 0.0)
+        sum_wy += wy
+    sum_wx = tl.zeros([BLOCK], dtype=tl.float32)
+    for tx in tl.static_range(MAX_TAPS_W):
+        xx = xstart + tx
+        wx = _resample_weight((xx - center_x + 0.5) * inv_w, cubic_a, CUBIC)
+        if ANTIALIAS:
+            wx = tl.where((xx >= 0.0) & (xx < Wf), wx, 0.0)
+        sum_wx += wx
+    denom = sum_wy * sum_wx
+    plane = (H * W).to(tl.int64)
+    Wl = W.to(tl.int64)
+    for c in tl.static_range(C):
+        base = off + c * plane
+        acc = tl.zeros([BLOCK], dtype=tl.float32)
+        for ty in tl.static_range(MAX_TAPS_H):
+            yy = ystart + ty
+            wy = _resample_weight((yy - center_y + 0.5) * inv_h, cubic_a, CUBIC)
+            if ANTIALIAS:
+                wy = tl.where((yy >= 0.0) & (yy < Hf), wy, 0.0)
+            yidx = tl.minimum(tl.maximum(yy.to(tl.int32), 0), H - 1).to(tl.int64)
+            row = base + yidx * Wl
+            for tx in tl.static_range(MAX_TAPS_W):
+                xx = xstart + tx
+                wx = _resample_weight((xx - center_x + 0.5) * inv_w, cubic_a, CUBIC)
+                if ANTIALIAS:
+                    wx = tl.where((xx >= 0.0) & (xx < Wf), wx, 0.0)
+                xidx = tl.minimum(tl.maximum(xx.to(tl.int32), 0), W - 1).to(tl.int64)
+                pix = tl.load(in_ptr + row + xidx, mask=mask, other=0.0)
+                acc += wy * wx * pix
+        acc = acc / denom
+        m = tl.load(mean_ptr + c)
+        s = tl.load(std_ptr + c)
+        acc = (acc - m) / s
+        oidx = ((n * C + c) * out_h + oy) * out_w + ox
+        tl.store(out_ptr + oidx, acc, mask=mask)
+def fused_resize_normalize(images, out_h, out_w, mean, std, rescale, interp, antialias, block: int = 256):
+    """Single fused launch over a ragged packed buffer -> (N, C, out_h, out_w) normalized float."""
+    import torch
+    images = list(images)
+    device = images[0].device
+    n = len(images)
+    cubic_a = -0.5 if antialias else -0.75  # PIL coeff under antialias, Keys coeff otherwise
+    max_taps_h = max_taps(images, out_h, 1, interp, antialias)
+    max_taps_w = max_taps(images, out_w, 2, interp, antialias)
+    mean_t, std_t = fold_mean_std(mean, std, rescale, device)
+    in_buf, offsets_t, heights_t, widths_t, c = pack_images(images)
+    out = torch.empty((n, c, out_h, out_w), device=device, dtype=torch.float32)
+    grid = (n, triton.cdiv(out_h * out_w, block))
+    _resize_normalize_kernel[grid](
+        in_buf, out, offsets_t, heights_t, widths_t, mean_t, std_t,
+        out_h, out_w, cubic_a, C=c, BLOCK=block,
+        CUBIC=(interp == "bicubic"), ANTIALIAS=antialias, MAX_TAPS_H=max_taps_h, MAX_TAPS_W=max_taps_w,
+    )
+    return out

torch-ext/kernel_image_resize/_pack.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""Ragged packing + resampling helpers shared by the fused and separable backends."""
+import math
+import torch
+PIL_RESAMPLE_TO_INTERP = {0: "bilinear", 2: "bilinear", 3: "bicubic"}
+def pack_images(
+    images: list[torch.Tensor], dtype: torch.dtype = torch.float32
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, int]:
+    """Concatenate a ragged list of CHW images into one flat buffer of `dtype`.
+    Returns (in_buf, offsets, heights, widths, channels); offsets[n] is the element index
+    where image n starts.
+    """
+    device = images[0].device
+    channels = images[0].shape[0]
+    flats, offsets, heights, widths, cur = [], [], [], [], 0
+    for img in images:
+        ic, ih, iw = img.shape
+        if ic != channels:
+            raise ValueError(f"all images must share channel count {channels}, got {ic}")
+        flats.append(img.reshape(-1).to(dtype))
+        offsets.append(cur)
+        heights.append(ih)
+        widths.append(iw)
+        cur += ic * ih * iw
+    in_buf = torch.cat(flats)
+    offsets_t = torch.tensor(offsets, device=device, dtype=torch.int64)
+    heights_t = torch.tensor(heights, device=device, dtype=torch.int32)
+    widths_t = torch.tensor(widths, device=device, dtype=torch.int32)
+    return in_buf, offsets_t, heights_t, widths_t, channels
+def fold_mean_std(mean, std, rescale: float, device) -> tuple[torch.Tensor, torch.Tensor]:
+    """Fold rescale into mean/std so the kernel does (x - m)/s == (x*rescale - mean)/std."""
+    mean_t = (torch.tensor(mean, device=device, dtype=torch.float32) / rescale).contiguous()
+    std_t = (torch.tensor(std, device=device, dtype=torch.float32) / rescale).contiguous()
+    return mean_t, std_t
+def max_taps(images: list[torch.Tensor], out_size: int, axis_dim: int, interp: str, antialias: bool) -> int:
+    """Batch-wide worst-case tap count for one axis = ceil(support) * 2 + 1."""
+    interp_half = 2.0 if interp == "bicubic" else 1.0
+    worst = 0
+    for img in images:
+        scale = img.shape[axis_dim] / out_size
+        eff = max(scale, 1.0) if antialias else 1.0
+        worst = max(worst, math.ceil(interp_half * eff) * 2 + 1)
+    return worst
+def as_image_list(images) -> list[torch.Tensor]:
+    """Accept a stacked (N, C, H, W) tensor or a list of CHW tensors; always return a list."""
+    if isinstance(images, torch.Tensor):
+        if images.dim() != 4:
+            raise ValueError(f"stacked input must be (N, C, H, W), got shape {tuple(images.shape)}")
+        return list(images)
+    return list(images)

torch-ext/kernel_image_resize/_separable.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""Separable resize + center-crop + normalize over a ragged uint8 batch.
+WHAT "RESIZE" DOES, CONCRETELY
+Every output pixel is a weighted average of a small window of input pixels. When you shrink
+an image a lot (with antialiasing) that window gets wide — e.g. 13 input pixels across and
+13 down, so 13x13 = 169 input pixels feed one output pixel.
+FUSED vs SEPARABLE (the two backends in this package)
+- FUSED (see _fused.py): for each output pixel, read the whole 2D window directly -> 169 reads.
+- SEPARABLE (this file): do the resize as two 1D steps instead of one 2D step:
+    step 1 (horizontal): resize only the WIDTH  -> an intermediate image
+    step 2 (vertical):   resize only the HEIGHT -> the final image
+  Each step's window is 1D, so 13 + 13 = 26 reads per output pixel instead of 169. Same math,
+  far fewer reads. This is what PIL and torchvision do.
+CENTER CROP (folded in, no extra pass)
+Processors like CLIP / DINOv2 resize to a "resize size" and then keep only the centered
+crop. We do not materialize the full resized image and slice it; instead each output pixel
+of the CROP maps to a resize-image coordinate by adding the crop offset, and that maps back
+to the input. So:
+    resize is described by  (resize_height, resize_width)   -- per image
+    crop   is described by  (crop_top, crop_left)           -- per image, the centered offset
+    output size is          (crop_height, crop_width)        -- the same for every image
+When there is no crop, resize size == crop size and the offsets are 0 (the plain resize).
+The resize SCALE uses the resize size; only the output coordinate is shifted by the crop.
+uint8 input + float intermediate; each 1D step renormalizes its own weights (matches
+torchvision). Output is parity-close to torchvision, not bit-identical (torchvision keeps a
+fixed-point uint8 intermediate; ours is more accurate float).
+"""
+import triton
+import triton.language as tl
+from ._fused import _resample_weight
+from ._pack import fold_mean_std, pack_images
+@triton.jit
+def _horizontal_resize_kernel(
+    input_pixels,          # flat uint8 buffer, all images packed back to back
+    intermediate,          # flat float32 output: width resized + col-cropped, height untouched
+    input_offsets,         # input_offsets[image] = where that image starts in input_pixels
+    intermediate_offsets,  # same idea for the intermediate buffer
+    heights, widths,       # per-image input height / width
+    resize_widths,         # per-image width to resize to (before cropping)
+    crop_lefts,            # per-image left offset of the centered crop
+    crop_w,                # output (crop) width, same for every image
+    cubic_coeff,
+    CHANNELS: tl.constexpr, BLOCK: tl.constexpr,
+    CUBIC: tl.constexpr, ANTIALIAS: tl.constexpr,
+    MAX_TAPS_COL: tl.constexpr,
+):
+    """Resize width to resize_width, keep only the cropped columns: uint8 (C,H,W) -> float (C,H,crop_w)."""
+    image_index = tl.program_id(0)
+    block_index = tl.program_id(1)
+    in_height = tl.load(heights + image_index)
+    in_width = tl.load(widths + image_index)
+    resize_width = tl.load(resize_widths + image_index)
+    crop_left = tl.load(crop_lefts + image_index)
+    input_start = tl.load(input_offsets + image_index)
+    intermediate_start = tl.load(intermediate_offsets + image_index)
+    in_width_f = in_width.to(tl.float32)
+    num_pixels = in_height * crop_w  # every input row x every cropped output column
+    flat_index = block_index * BLOCK + tl.arange(0, BLOCK)
+    active = flat_index < num_pixels
+    input_row = flat_index // crop_w
+    out_col = flat_index % crop_w
+    resize_col = out_col + crop_left  # column in the (uncropped) resized image
+    filter_half = 2.0 if CUBIC else 1.0
+    col_scale = in_width_f / resize_width.to(tl.float32)
+    col_filter_scale = tl.maximum(col_scale, 1.0) if ANTIALIAS else 1.0
+    col_support = filter_half * col_filter_scale
+    col_inv_scale = 1.0 / col_filter_scale
+    src_center_col = col_scale * (resize_col.to(tl.float32) + 0.5)
+    first_tap_col = tl.floor(src_center_col - col_support + 0.5)
+    col_weight_sum = tl.zeros([BLOCK], dtype=tl.float32)
+    for tap in tl.static_range(MAX_TAPS_COL):
+        tap_col = first_tap_col + tap
+        weight = _resample_weight((tap_col - src_center_col + 0.5) * col_inv_scale, cubic_coeff, CUBIC)
+        if ANTIALIAS:
+            weight = tl.where((tap_col >= 0.0) & (tap_col < in_width_f), weight, 0.0)
+        col_weight_sum += weight
+    input_plane = (in_height * in_width).to(tl.int64)
+    intermediate_plane = (in_height * crop_w).to(tl.int64)
+    in_width_i64 = in_width.to(tl.int64)
+    crop_w_i64 = crop_w.to(tl.int64)
+    input_row_i64 = input_row.to(tl.int64)
+    for channel in tl.static_range(CHANNELS):
+        input_row_base = input_start + channel * input_plane + input_row_i64 * in_width_i64
+        accumulator = tl.zeros([BLOCK], dtype=tl.float32)
+        for tap in tl.static_range(MAX_TAPS_COL):
+            tap_col = first_tap_col + tap
+            weight = _resample_weight((tap_col - src_center_col + 0.5) * col_inv_scale, cubic_coeff, CUBIC)
+            if ANTIALIAS:
+                weight = tl.where((tap_col >= 0.0) & (tap_col < in_width_f), weight, 0.0)
+            clamped_tap_col = tl.minimum(tl.maximum(tap_col.to(tl.int32), 0), in_width - 1).to(tl.int64)
+            pixel = tl.load(input_pixels + input_row_base + clamped_tap_col, mask=active, other=0).to(tl.float32)
+            accumulator += weight * pixel
+        accumulator = accumulator / col_weight_sum
+        write_index = intermediate_start + channel * intermediate_plane + input_row_i64 * crop_w_i64 + out_col
+        tl.store(intermediate + write_index, accumulator, mask=active)
+@triton.jit
+def _vertical_resize_normalize_kernel(
+    intermediate,          # float32 from the horizontal step: (C, H, crop_w) per image
+    output,                # final (N, C, crop_h, crop_w) float32
+    intermediate_offsets,
+    heights,               # per-image input height (the intermediate still has H rows)
+    resize_heights,        # per-image height to resize to (before cropping)
+    crop_tops,             # per-image top offset of the centered crop
+    means, stds,           # per-channel normalization, rescale already folded in
+    crop_h, crop_w,
+    cubic_coeff,
+    CHANNELS: tl.constexpr, BLOCK: tl.constexpr,
+    CUBIC: tl.constexpr, ANTIALIAS: tl.constexpr,
+    MAX_TAPS_ROW: tl.constexpr,
+):
+    """Resize height to resize_height, keep cropped rows, normalize: float (C,H,crop_w) -> (C,crop_h,crop_w)."""
+    image_index = tl.program_id(0)
+    block_index = tl.program_id(1)
+    in_height = tl.load(heights + image_index)
+    resize_height = tl.load(resize_heights + image_index)
+    crop_top = tl.load(crop_tops + image_index)
+    intermediate_start = tl.load(intermediate_offsets + image_index)
+    in_height_f = in_height.to(tl.float32)
+    num_pixels = crop_h * crop_w
+    flat_index = block_index * BLOCK + tl.arange(0, BLOCK)
+    active = flat_index < num_pixels
+    out_row = flat_index // crop_w
+    out_col = flat_index % crop_w
+    resize_row = out_row + crop_top  # row in the (uncropped) resized image
+    filter_half = 2.0 if CUBIC else 1.0
+    row_scale = in_height_f / resize_height.to(tl.float32)
+    row_filter_scale = tl.maximum(row_scale, 1.0) if ANTIALIAS else 1.0
+    row_support = filter_half * row_filter_scale
+    row_inv_scale = 1.0 / row_filter_scale
+    src_center_row = row_scale * (resize_row.to(tl.float32) + 0.5)
+    first_tap_row = tl.floor(src_center_row - row_support + 0.5)
+    row_weight_sum = tl.zeros([BLOCK], dtype=tl.float32)
+    for tap in tl.static_range(MAX_TAPS_ROW):
+        tap_row = first_tap_row + tap
+        weight = _resample_weight((tap_row - src_center_row + 0.5) * row_inv_scale, cubic_coeff, CUBIC)
+        if ANTIALIAS:
+            weight = tl.where((tap_row >= 0.0) & (tap_row < in_height_f), weight, 0.0)
+        row_weight_sum += weight
+    intermediate_plane = (in_height * crop_w).to(tl.int64)
+    crop_w_i64 = crop_w.to(tl.int64)
+    out_col_i64 = out_col.to(tl.int64)
+    for channel in tl.static_range(CHANNELS):
+        channel_base = intermediate_start + channel * intermediate_plane
+        accumulator = tl.zeros([BLOCK], dtype=tl.float32)
+        for tap in tl.static_range(MAX_TAPS_ROW):
+            tap_row = first_tap_row + tap
+            weight = _resample_weight((tap_row - src_center_row + 0.5) * row_inv_scale, cubic_coeff, CUBIC)
+            if ANTIALIAS:
+                weight = tl.where((tap_row >= 0.0) & (tap_row < in_height_f), weight, 0.0)
+            clamped_tap_row = tl.minimum(tl.maximum(tap_row.to(tl.int32), 0), in_height - 1).to(tl.int64)
+            pixel = tl.load(intermediate + channel_base + clamped_tap_row * crop_w_i64 + out_col_i64, mask=active, other=0.0)
+            accumulator += weight * pixel
+        accumulator = accumulator / row_weight_sum
+        mean = tl.load(means + channel)
+        std = tl.load(stds + channel)
+        accumulator = (accumulator - mean) / std
+        write_index = ((image_index * CHANNELS + channel) * crop_h + out_row) * crop_w + out_col
+        tl.store(output + write_index, accumulator, mask=active)
+def _axis_max_taps(in_sizes, resize_sizes, interp, antialias):
+    """Widest 1D window over the batch for one axis = ceil(support) * 2 + 1, support uses in/resize."""
+    import math
+    interp_half = 2.0 if interp == "bicubic" else 1.0
+    worst = 0
+    for in_size, resize_size in zip(in_sizes, resize_sizes):
+        scale = in_size / resize_size
+        eff = max(scale, 1.0) if antialias else 1.0
+        worst = max(worst, math.ceil(interp_half * eff) * 2 + 1)
+    return worst
+def _run_separable(images, resize_heights, resize_widths, crop_tops, crop_lefts, crop_h, crop_w,
+                   mean, std, rescale, interp, antialias, block):
+    """Core driver: resize each image to its (resize_h, resize_w), keep the centered crop, normalize."""
+    import torch
+    device = images[0].device
+    num_images = len(images)
+    cubic_coeff = -0.5 if antialias else -0.75
+    in_heights = [int(img.shape[1]) for img in images]
+    in_widths = [int(img.shape[2]) for img in images]
+    max_taps_row = _axis_max_taps(in_heights, resize_heights, interp, antialias)
+    max_taps_col = _axis_max_taps(in_widths, resize_widths, interp, antialias)
+    means, stds = fold_mean_std(mean, std, rescale, device)
+    input_pixels, input_offsets, heights, widths, channels = pack_images(images, dtype=torch.uint8)
+    intermediate_offsets_list, cursor, tallest = [], 0, 0
+    for height in in_heights:
+        intermediate_offsets_list.append(cursor)
+        cursor += channels * height * crop_w
+        tallest = max(tallest, height)
+    intermediate = torch.empty(cursor, device=device, dtype=torch.float32)
+    intermediate_offsets = torch.tensor(intermediate_offsets_list, device=device, dtype=torch.int64)
+    resize_heights_t = torch.tensor(resize_heights, device=device, dtype=torch.int32)
+    resize_widths_t = torch.tensor(resize_widths, device=device, dtype=torch.int32)
+    crop_tops_t = torch.tensor(crop_tops, device=device, dtype=torch.int32)
+    crop_lefts_t = torch.tensor(crop_lefts, device=device, dtype=torch.int32)
+    horizontal_grid = (num_images, triton.cdiv(tallest * crop_w, block))
+    _horizontal_resize_kernel[horizontal_grid](
+        input_pixels, intermediate, input_offsets, intermediate_offsets, heights, widths,
+        resize_widths_t, crop_lefts_t, crop_w, cubic_coeff,
+        CHANNELS=channels, BLOCK=block, CUBIC=(interp == "bicubic"), ANTIALIAS=antialias,
+        MAX_TAPS_COL=max_taps_col,
+    )
+    output = torch.empty((num_images, channels, crop_h, crop_w), device=device, dtype=torch.float32)
+    vertical_grid = (num_images, triton.cdiv(crop_h * crop_w, block))
+    _vertical_resize_normalize_kernel[vertical_grid](
+        intermediate, output, intermediate_offsets, heights, resize_heights_t, crop_tops_t, means, stds,
+        crop_h, crop_w, cubic_coeff,
+        CHANNELS=channels, BLOCK=block, CUBIC=(interp == "bicubic"), ANTIALIAS=antialias,
+        MAX_TAPS_ROW=max_taps_row,
+    )
+    return output
+def _aspect_preserving_size(in_h, in_w, shortest_edge):
+    """transformers shortest-edge rule: short side -> shortest_edge, long side truncated (int(), not round)."""
+    if in_h <= in_w:
+        return shortest_edge, int(in_w * shortest_edge / in_h)
+    return int(in_h * shortest_edge / in_w), shortest_edge
+def separable_resize_normalize(images, out_h, out_w, mean, std, rescale, interp, antialias, block: int = 256):
+    """Resize to (out_h, out_w) and normalize (no crop)."""
+    images = list(images)
+    n = len(images)
+    return _run_separable(images, [out_h] * n, [out_w] * n, [0] * n, [0] * n, out_h, out_w,
+                          mean, std, rescale, interp, antialias, block)
+def separable_resize_crop_normalize(images, resize_size, crop_size, mean, std, rescale, interp, antialias,
+                                    resize_mode="square", block: int = 256):
+    """Resize then center-crop then normalize.
+    resize_mode="square": resize_size is (resize_h, resize_w) applied to every image.
+    resize_mode="shortest_edge": resize_size is an int; each image is resized aspect-preserving
+    so its short side equals it, then center-cropped to crop_size.
+    """
+    images = list(images)
+    crop_h, crop_w = crop_size
+    resize_heights, resize_widths = [], []
+    for img in images:
+        in_h, in_w = int(img.shape[1]), int(img.shape[2])
+        if resize_mode == "shortest_edge":
+            rh, rw = _aspect_preserving_size(in_h, in_w, int(resize_size))
+        elif resize_mode == "square":
+            rh, rw = int(resize_size[0]), int(resize_size[1])
+        else:
+            raise ValueError(f"resize_mode must be 'square' or 'shortest_edge', got {resize_mode!r}")
+        if rh < crop_h or rw < crop_w:
+            raise ValueError(f"resize size ({rh},{rw}) smaller than crop ({crop_h},{crop_w})")
+        resize_heights.append(rh)
+        resize_widths.append(rw)
+    crop_tops = [(rh - crop_h) // 2 for rh in resize_heights]
+    crop_lefts = [(rw - crop_w) // 2 for rw in resize_widths]
+    return _run_separable(images, resize_heights, resize_widths, crop_tops, crop_lefts, crop_h, crop_w,
+                          mean, std, rescale, interp, antialias, block)