# kernel_image_resize — how every op works (study notes) This explains the whole package end to end: the resampling math, the data layout, and every kernel op, plus the benchmark findings. It is meant for pen & paper — there is a fully worked numeric example you can reproduce by hand in the "Worked example" section. The package does one thing: **resize + rescale + normalize**, the op sequence a `transformers` fast image processor runs (`TorchvisionBackend`: resize, then `(x*rescale - mean)/std`), as a single GPU pipeline. Input: raw CHW `uint8` images (any size, ragged). Output: `(N, C, out_h, out_w)` normalized `float32`. --- ## 1. The one idea behind resizing Resizing does not "pick" pixels; each **output** pixel is a **weighted average of a small window of input pixels**. Two things define that average: 1. **Where** in the input an output pixel lands (its center). 2. **Which** input pixels are in its window, and with **what weights**. The weight of an input pixel falls off with distance from the center, following a filter curve: - **bilinear** → a triangle, width 1 on each side (so 2 input pixels per axis normally). - **bicubic** → a cubic bump, width 2 on each side (so 4 input pixels per axis normally). When you **shrink** an image, you must also **blur first** or you get aliasing. That is what `antialias=True` does: it widens the window so each output pixel averages more input pixels (a low-pass filter before throwing pixels away). Widening is proportional to the shrink factor, so shrinking 3× turns a 4-tap bicubic into ~13 taps. --- ## 2. The resampling-weight formula (the heart of everything) All kernels use the same formula, which matches PyTorch's aten `UpSampleKernel` (`align_corners=False`, "half-pixel" convention). For one axis: ``` scale = in_size / out_size # > 1 means shrinking interp_half = 1 (bilinear) or 2 (bicubic) # half-width of the filter cubic_a = -0.75 (no antialias) or -0.5 (antialias) # the cubic curve's shape constant # antialias only widens the window, and only when shrinking: if antialias and scale > 1: eff = scale # window widens by the shrink factor else: eff = 1 # plain 2-tap / 4-tap interpolation support = interp_half * eff # half-window width, in INPUT pixels inv = 1 / eff # squashes the filter curve to match the widened window # for output index i: center = scale * (i + 0.5) # input coordinate this output maps to first_tap = floor(center - support + 0.5) # leftmost input pixel in the window # for each tap t = 0, 1, 2, ... (up to MAX_TAPS): tap_pos = first_tap + t # an input pixel index arg = (tap_pos - center + 0.5) * inv # distance from center, squashed weight = filter(arg) # triangle or cubic, see below ``` The filter (`_resample_weight` in `_fused.py`), with `x = |arg|`: ``` bilinear: max(1 - x, 0) # triangle, zero past 1 bicubic: x <= 1 : (a+2)x^3 - (a+3)x^2 + 1 1= in_size`), then **renormalize** by dividing by the sum of the realized weights. This keeps the average correct at the edges. That renormalization is why every kernel computes a `weight_sum` and divides by it. For the non-antialias case `weight_sum == 1`, so the division is a harmless no-op. --- ## 3. Worked example (do this by hand) **bilinear, no antialias, one axis, in_size=4, out_size=2.** ``` scale = 4/2 = 2, interp_half = 1, eff = 1, support = 1, inv = 1 ``` Output pixel `i = 0`: ``` center = 2 * (0 + 0.5) = 1.0 first_tap = floor(1.0 - 1 + 0.5) = floor(0.5) = 0 t=0: tap_pos=0, arg=(0-1.0+0.5)= -0.5 -> weight = 1-0.5 = 0.5 t=1: tap_pos=1, arg=(1-1.0+0.5)= 0.5 -> weight = 1-0.5 = 0.5 t=2: tap_pos=2, arg=(2-1.0+0.5)= 1.5 -> weight = max(1-1.5,0) = 0 weight_sum = 1.0 output[0] = (0.5*in[0] + 0.5*in[1]) / 1.0 # halfway between in[0] and in[1] ``` Output pixel `i = 1`: ``` center = 2 * 1.5 = 3.0 first_tap = floor(3.0 - 0.5) = 2 t=0: tap_pos=2, arg=-0.5 -> 0.5 t=1: tap_pos=3, arg= 0.5 -> 0.5 t=2: tap_pos=4, arg= 1.5 -> 0 (index 4 would clamp to 3, but weight is 0 anyway) output[1] = 0.5*in[2] + 0.5*in[3] ``` This 1-D operation is exactly one pass of the separable kernel. The 2-D result is the same formula applied on both axes (rows and columns). --- ## 4. Data layout (host side, `_pack.py`) Ragged images (different H×W) cannot be stacked into one tensor, so they are flattened and concatenated into one buffer, with side tables describing each image. `pack_images(images, dtype)` → ``` input_pixels : 1-D buffer, all images flattened (C,H,W row-major) and concatenated offsets[n] : element index where image n starts heights[n], widths[n] : that image's H and W channels : C (shared by all images) ``` Address of input pixel `(channel, row, col)` of image `n`: ``` input_pixels[ offsets[n] + channel*(H*W) + row*W + col ] ``` The separable path packs as `uint8` (1 byte/pixel, half the memory traffic of float). `fold_mean_std(mean, std, rescale)` → folds the rescale factor into the normalization constants so the kernel does a single `(x - m)/s`: ``` m = mean / rescale s = std / rescale (x - m)/s == (x*rescale - mean)/std # identical to the processor's fused normalize ``` `max_taps(images, out_size, axis, interp, antialias)` → the **widest** window in the batch = `ceil(support) * 2 + 1`. A Triton loop bound must be a compile-time constant, so every program loops this fixed count; taps beyond a given pixel's real window get ~0 weight. `as_image_list` → accepts a stacked `(N,C,H,W)` tensor or a list, always returns a list. --- ## 5. Fused kernel (`_fused.py`, `backend="fused"`) One launch. **One program = one image + a BLOCK of its output pixels.** Each output pixel reads the **full 2-D window** directly: `MAX_TAPS_H × MAX_TAPS_W` input pixels. ``` grid = (num_images, ceil(out_h*out_w / BLOCK)) per lane (one output pixel): oy, ox = (flat_index // out_w, flat_index % out_w) center_y, center_x, first_tap_y, first_tap_x # section 2, both axes # weight_sum factorizes across axes (separable math, even though the LOADS are 2-D): sum_wy = Σ_ty filter_y ; sum_wx = Σ_tx filter_x ; denom = sum_wy * sum_wx for channel: acc = 0 for ty in 0..MAX_TAPS_H: # <-- the 2-D window: TAPS_H * TAPS_W loads for tx in 0..MAX_TAPS_W: weight = filter_y(ty) * filter_x(tx) pixel = input_pixels[channel, clamp(tap_y), clamp(tap_x)] acc += weight * pixel acc = acc / denom out[image, channel, oy, ox] = (acc - mean[channel]) / std[channel] ``` Cost per output pixel: `TAPS_H * TAPS_W` loads (e.g. 13×13 = **169**). Correct and simple, but the 2-D load count is what makes it slow — hence the separable version. --- ## 6. Separable kernel (`_separable.py`, `backend="separable"`, the default) Same math, but the 2-D window is done as **two 1-D passes**, with a float intermediate buffer in between. Loads per output pixel: `TAPS_W + TAPS_H` (e.g. 13+13 = **26**). ``` input_pixels (uint8, C×H×W) --pass1--> intermediate (float, C×H×out_w) --pass2--> output (float, C×out_h×out_w) resize W resize H + normalize ``` The **intermediate** is the key object: same **height** as the input, but already the **final width**. ("Tall and narrow.") It is also ragged in height, so it gets its own offset table (built in `separable_resize_normalize`, same scheme as `pack_images`). ### Pass 1 — `_horizontal_resize_kernel` (resize width only) ``` grid = (num_images, ceil(H*out_w / BLOCK)) # work = every input row × every output col per lane: input_row = flat_index // out_w # row index, UNCHANGED by this pass out_col = flat_index % out_w # output column being computed center_x, first_tap_x, col_weight_sum # section 2, COLUMN axis only for channel: acc = 0 for tap in 0..MAX_TAPS_COL: # <-- 1-D: only TAPS_W loads weight = filter_x(tap) pixel = input_pixels[channel, input_row, clamp(tap_col)] # uint8 -> float acc += weight * pixel acc = acc / col_weight_sum intermediate[channel, input_row, out_col] = acc # NO normalize yet ``` Reads original `uint8` bytes; writes `float32`. No normalization here. ### Pass 2 — `_vertical_resize_normalize_kernel` (resize height, then normalize) ``` grid = (num_images, ceil(out_h*out_w / BLOCK)) # work = every output pixel per lane: out_row = flat_index // out_w out_col = flat_index % out_w center_y, first_tap_y, row_weight_sum # section 2, ROW axis only for channel: acc = 0 for tap in 0..MAX_TAPS_ROW: # <-- 1-D: only TAPS_H loads weight = filter_y(tap) pixel = intermediate[channel, clamp(tap_row), out_col] # float acc += weight * pixel acc = acc / row_weight_sum out[image, channel, out_row, out_col] = (acc - mean[channel]) / std[channel] # normalize here ``` Two launches (an implicit sync between them), so pass 2 always sees pass 1's finished output. ### Why separable wins `TAPS_W + TAPS_H` loads instead of `TAPS_W * TAPS_H`. For a 13×13 window that is 26 vs 169. This is exactly the algorithm PIL and torchvision use. The catch: an extra full-size float intermediate buffer (more memory traffic), but the read-count reduction dominates. Parity note: the intermediate here is **float32**; torchvision keeps a **fixed-point uint8** intermediate. So the separable output is parity-*close* to torchvision, not bit-identical — and the float version is actually the more accurate one. --- ## 7. Public API (`__init__.py`) ``` resize_normalize(images, size, image_mean, image_std, rescale_factor=1/255, resample="bilinear", antialias=False, backend="separable", block=256) ``` - `images`: stacked `(N,C,H,W)` tensor or a list of CHW tensors. - `size`: int (square), `(H,W)`, or `{"height","width"}`. - `resample`: `"bilinear"`/`"bicubic"`, or a PIL resample int (0/2→bilinear, 3→bicubic). - `backend`: `"separable"` (default, fastest) or `"fused"` (2-D reference). - `resize_normalize_ragged`: same kernels, list-only. --- ## 8. Benchmark findings (A100, CUDA_VISIBLE_DEVICES=1) ### Standalone resize+normalize — SigLIP-so400m config, N=32 ragged 384–1024², out 384², bicubic+AA ``` torchvision eager loop : 2.91 ms (per-image float loop) torchvision compiled : 5.70 ms (torch.compile dynamic, per-image; slower than eager) torchvision compiled pkt: 2.55 ms (one graph over a padded stack; timing only) fused triton (2D) : 11.49 ms (taps*taps; the slow reference) separable triton (uint8): 1.29 ms (taps+taps) <-- fastest real processor : 3.92 ms ``` **Separable is ~3× the real processor**, parity ≤1e-4 vs torchvision-float. The fused 2-D loses for the algorithmic reason above (169 vs 26 loads). `torch.compile` does not help: per-image it is *slower* (dispatch overhead over 32 ragged shapes); even as one packed graph it only matches the eager loop, because inductor's interpolate is no faster than aten resize. ### End-to-end inference — Siglip2-base-patch16-224, **bf16** forward ``` preprocess forward(fixed input) preprocess+forward processor 3.99 12.86 14.44 separable 0.93 13.02 13.76 <-- ~5% faster e2e fused 2.00 13.01 14.79 compiled 6.14 12.89 14.00 feature parity (separable/fused/compiled vs processor): 9.38e-2 = 1.2% of feature max ``` - `forward(fixed input)` is identical (~12.9 ms) for all → **no inference regression**; the model does not care which preprocessor made the tensor. - The 1.2% feature drift is the float-vs-uint8 resize difference, identical across all float backends → not a bug. The float path is the more accurate one. - End-to-end win is ~5% with a bf16 forward (was ~0.5% with fp32, where the forward was ~80 ms). **The win scales with how preprocessing-bound you are.** ### Data path from JPEG bytes — 552 KB/img ``` CPU decode + torchvision resize : 177.5 ms (status quo) CPU decode + separable kernel : 176.4 ms (kernel saves ~1 ms; decode dominates) GPU decode (nvJPEG) + kernel : 14.8 ms (fully on-GPU) ``` - ~175 ms of the 177 ms is **CPU JPEG decode + host→device copy**. Resize/normalize is ~1%. - The 12× win (177→15) is **GPU decode (nvJPEG)**, i.e. `torchvision.io.decode_jpeg(device="cuda")` — *not* the kernel. The kernel is the resize/normalize component of that GPU pipeline. --- ## 9. What is true / what to claim - The kernel is **correct** (≤1e-4 vs torchvision-float, more accurate than the processor's uint8 path) and feeds the model with **no inference regression**. - It is **~3× the real processor at the resize/normalize stage** — a real, parity-clean win. - It does **not** speed up preprocessing 12×. Decode dominates the data path; the GPU-decode lever is nvJPEG, a torchvision feature, not this kernel. - The kernel matters end-to-end only once you are **not decode-bound**: in a GPU-decode pipeline it keeps resize/normalize minimal (~10% of that pipeline), and its standalone preprocess win shows up when the forward is small (bf16, small model, large batch). - Honest one-liner: *"GPU-native resize+normalize, 3× the fast processor at that stage, drop-in for a GPU-decode pipeline."*