| # kernel_image_resize — how every op works (study notes) |
|
|
| This explains the whole package end to end: the resampling math, the data layout, and |
| every kernel op, plus the benchmark findings. It is meant for pen & paper — there is a |
| fully worked numeric example you can reproduce by hand in the "Worked example" section. |
|
|
| The package does one thing: **resize + rescale + normalize**, the op sequence a |
| `transformers` fast image processor runs (`TorchvisionBackend`: resize, then |
| `(x*rescale - mean)/std`), as a single GPU pipeline. Input: raw CHW `uint8` images |
| (any size, ragged). Output: `(N, C, out_h, out_w)` normalized `float32`. |
|
|
| --- |
|
|
| ## 1. The one idea behind resizing |
|
|
| Resizing does not "pick" pixels; each **output** pixel is a **weighted average of a small |
| window of input pixels**. Two things define that average: |
|
|
| 1. **Where** in the input an output pixel lands (its center). |
| 2. **Which** input pixels are in its window, and with **what weights**. |
|
|
| The weight of an input pixel falls off with distance from the center, following a filter |
| curve: |
|
|
| - **bilinear** → a triangle, width 1 on each side (so 2 input pixels per axis normally). |
| - **bicubic** → a cubic bump, width 2 on each side (so 4 input pixels per axis normally). |
|
|
| When you **shrink** an image, you must also **blur first** or you get aliasing. That is |
| what `antialias=True` does: it widens the window so each output pixel averages more input |
| pixels (a low-pass filter before throwing pixels away). Widening is proportional to the |
| shrink factor, so shrinking 3× turns a 4-tap bicubic into ~13 taps. |
|
|
| --- |
|
|
| ## 2. The resampling-weight formula (the heart of everything) |
|
|
| All kernels use the same formula, which matches PyTorch's aten `UpSampleKernel` |
| (`align_corners=False`, "half-pixel" convention). For one axis: |
|
|
| ``` |
| scale = in_size / out_size # > 1 means shrinking |
| interp_half = 1 (bilinear) or 2 (bicubic) # half-width of the filter |
| cubic_a = -0.75 (no antialias) or -0.5 (antialias) # the cubic curve's shape constant |
| |
| # antialias only widens the window, and only when shrinking: |
| if antialias and scale > 1: |
| eff = scale # window widens by the shrink factor |
| else: |
| eff = 1 # plain 2-tap / 4-tap interpolation |
| support = interp_half * eff # half-window width, in INPUT pixels |
| inv = 1 / eff # squashes the filter curve to match the widened window |
| |
| # for output index i: |
| center = scale * (i + 0.5) # input coordinate this output maps to |
| first_tap = floor(center - support + 0.5) # leftmost input pixel in the window |
| |
| # for each tap t = 0, 1, 2, ... (up to MAX_TAPS): |
| tap_pos = first_tap + t # an input pixel index |
| arg = (tap_pos - center + 0.5) * inv # distance from center, squashed |
| weight = filter(arg) # triangle or cubic, see below |
| ``` |
|
|
| The filter (`_resample_weight` in `_fused.py`), with `x = |arg|`: |
|
|
| ``` |
| bilinear: max(1 - x, 0) # triangle, zero past 1 |
| |
| bicubic: x <= 1 : (a+2)x^3 - (a+3)x^2 + 1 |
| 1<x<2 : a x^3 - 5a x^2 + 8a x - 4a |
| else : 0 |
| ``` |
|
|
| Two edge rules (both kernels do this identically): |
|
|
| - **non-antialias**: clamp the tap index into `[0, in_size-1]` → replicates the border |
| pixel. The filter weights of a standard 2/4-tap interpolation already sum to 1. |
| - **antialias**: instead set the weight to **0** for taps that fall off the image |
| (`tap_pos < 0` or `>= in_size`), then **renormalize** by dividing by the sum of the |
| realized weights. This keeps the average correct at the edges. |
|
|
| That renormalization is why every kernel computes a `weight_sum` and divides by it. For |
| the non-antialias case `weight_sum == 1`, so the division is a harmless no-op. |
|
|
| --- |
|
|
| ## 3. Worked example (do this by hand) |
|
|
| **bilinear, no antialias, one axis, in_size=4, out_size=2.** |
|
|
| ``` |
| scale = 4/2 = 2, interp_half = 1, eff = 1, support = 1, inv = 1 |
| ``` |
|
|
| Output pixel `i = 0`: |
| ``` |
| center = 2 * (0 + 0.5) = 1.0 |
| first_tap = floor(1.0 - 1 + 0.5) = floor(0.5) = 0 |
| t=0: tap_pos=0, arg=(0-1.0+0.5)= -0.5 -> weight = 1-0.5 = 0.5 |
| t=1: tap_pos=1, arg=(1-1.0+0.5)= 0.5 -> weight = 1-0.5 = 0.5 |
| t=2: tap_pos=2, arg=(2-1.0+0.5)= 1.5 -> weight = max(1-1.5,0) = 0 |
| weight_sum = 1.0 |
| output[0] = (0.5*in[0] + 0.5*in[1]) / 1.0 # halfway between in[0] and in[1] |
| ``` |
|
|
| Output pixel `i = 1`: |
| ``` |
| center = 2 * 1.5 = 3.0 |
| first_tap = floor(3.0 - 0.5) = 2 |
| t=0: tap_pos=2, arg=-0.5 -> 0.5 |
| t=1: tap_pos=3, arg= 0.5 -> 0.5 |
| t=2: tap_pos=4, arg= 1.5 -> 0 (index 4 would clamp to 3, but weight is 0 anyway) |
| output[1] = 0.5*in[2] + 0.5*in[3] |
| ``` |
|
|
| This 1-D operation is exactly one pass of the separable kernel. The 2-D result is the same |
| formula applied on both axes (rows and columns). |
|
|
| --- |
|
|
| ## 4. Data layout (host side, `_pack.py`) |
| |
| Ragged images (different H×W) cannot be stacked into one tensor, so they are flattened and |
| concatenated into one buffer, with side tables describing each image. |
| |
| `pack_images(images, dtype)` → |
| ``` |
| input_pixels : 1-D buffer, all images flattened (C,H,W row-major) and concatenated |
| offsets[n] : element index where image n starts |
| heights[n], widths[n] : that image's H and W |
| channels : C (shared by all images) |
| ``` |
| Address of input pixel `(channel, row, col)` of image `n`: |
| ``` |
| input_pixels[ offsets[n] + channel*(H*W) + row*W + col ] |
| ``` |
| The separable path packs as `uint8` (1 byte/pixel, half the memory traffic of float). |
|
|
| `fold_mean_std(mean, std, rescale)` → folds the rescale factor into the normalization |
| constants so the kernel does a single `(x - m)/s`: |
| ``` |
| m = mean / rescale s = std / rescale |
| (x - m)/s == (x*rescale - mean)/std # identical to the processor's fused normalize |
| ``` |
|
|
| `max_taps(images, out_size, axis, interp, antialias)` → the **widest** window in the batch |
| = `ceil(support) * 2 + 1`. A Triton loop bound must be a compile-time constant, so every |
| program loops this fixed count; taps beyond a given pixel's real window get ~0 weight. |
|
|
| `as_image_list` → accepts a stacked `(N,C,H,W)` tensor or a list, always returns a list. |
|
|
| --- |
|
|
| ## 5. Fused kernel (`_fused.py`, `backend="fused"`) |
| |
| One launch. **One program = one image + a BLOCK of its output pixels.** Each output pixel |
| reads the **full 2-D window** directly: `MAX_TAPS_H × MAX_TAPS_W` input pixels. |
| |
| ``` |
| grid = (num_images, ceil(out_h*out_w / BLOCK)) |
|
|
| per lane (one output pixel): |
| oy, ox = (flat_index // out_w, flat_index % out_w) |
| center_y, center_x, first_tap_y, first_tap_x # section 2, both axes |
|
|
| # weight_sum factorizes across axes (separable math, even though the LOADS are 2-D): |
| sum_wy = Σ_ty filter_y ; sum_wx = Σ_tx filter_x ; denom = sum_wy * sum_wx |
| |
| for channel: |
| acc = 0 |
| for ty in 0..MAX_TAPS_H: # <-- the 2-D window: TAPS_H * TAPS_W loads |
| for tx in 0..MAX_TAPS_W: |
| weight = filter_y(ty) * filter_x(tx) |
| pixel = input_pixels[channel, clamp(tap_y), clamp(tap_x)] |
| acc += weight * pixel |
| acc = acc / denom |
| out[image, channel, oy, ox] = (acc - mean[channel]) / std[channel] |
| ``` |
| |
| Cost per output pixel: `TAPS_H * TAPS_W` loads (e.g. 13×13 = **169**). Correct and simple, |
| but the 2-D load count is what makes it slow — hence the separable version. |
|
|
| --- |
|
|
| ## 6. Separable kernel (`_separable.py`, `backend="separable"`, the default) |
| |
| Same math, but the 2-D window is done as **two 1-D passes**, with a float intermediate |
| buffer in between. Loads per output pixel: `TAPS_W + TAPS_H` (e.g. 13+13 = **26**). |
| |
| ``` |
| input_pixels (uint8, C×H×W) --pass1--> intermediate (float, C×H×out_w) --pass2--> output (float, C×out_h×out_w) |
| resize W resize H + normalize |
| ``` |
| |
| The **intermediate** is the key object: same **height** as the input, but already the |
| **final width**. ("Tall and narrow.") It is also ragged in height, so it gets its own |
| offset table (built in `separable_resize_normalize`, same scheme as `pack_images`). |
|
|
| ### Pass 1 — `_horizontal_resize_kernel` (resize width only) |
| |
| ``` |
| grid = (num_images, ceil(H*out_w / BLOCK)) # work = every input row × every output col |
| per lane: |
| input_row = flat_index // out_w # row index, UNCHANGED by this pass |
| out_col = flat_index % out_w # output column being computed |
| |
| center_x, first_tap_x, col_weight_sum # section 2, COLUMN axis only |
| |
| for channel: |
| acc = 0 |
| for tap in 0..MAX_TAPS_COL: # <-- 1-D: only TAPS_W loads |
| weight = filter_x(tap) |
| pixel = input_pixels[channel, input_row, clamp(tap_col)] # uint8 -> float |
| acc += weight * pixel |
| acc = acc / col_weight_sum |
| intermediate[channel, input_row, out_col] = acc # NO normalize yet |
| ``` |
| |
| Reads original `uint8` bytes; writes `float32`. No normalization here. |
|
|
| ### Pass 2 — `_vertical_resize_normalize_kernel` (resize height, then normalize) |
|
|
| ``` |
| grid = (num_images, ceil(out_h*out_w / BLOCK)) # work = every output pixel |
| per lane: |
| out_row = flat_index // out_w |
| out_col = flat_index % out_w |
| |
| center_y, first_tap_y, row_weight_sum # section 2, ROW axis only |
| |
| for channel: |
| acc = 0 |
| for tap in 0..MAX_TAPS_ROW: # <-- 1-D: only TAPS_H loads |
| weight = filter_y(tap) |
| pixel = intermediate[channel, clamp(tap_row), out_col] # float |
| acc += weight * pixel |
| acc = acc / row_weight_sum |
| out[image, channel, out_row, out_col] = (acc - mean[channel]) / std[channel] # normalize here |
| ``` |
|
|
| Two launches (an implicit sync between them), so pass 2 always sees pass 1's finished |
| output. |
|
|
| ### Why separable wins |
|
|
| `TAPS_W + TAPS_H` loads instead of `TAPS_W * TAPS_H`. For a 13×13 window that is 26 vs 169. |
| This is exactly the algorithm PIL and torchvision use. The catch: an extra full-size float |
| intermediate buffer (more memory traffic), but the read-count reduction dominates. |
|
|
| Parity note: the intermediate here is **float32**; torchvision keeps a **fixed-point |
| uint8** intermediate. So the separable output is parity-*close* to torchvision, not |
| bit-identical — and the float version is actually the more accurate one. |
|
|
| --- |
|
|
| ## 7. Public API (`__init__.py`) |
|
|
| ``` |
| resize_normalize(images, size, image_mean, image_std, |
| rescale_factor=1/255, resample="bilinear", antialias=False, |
| backend="separable", block=256) |
| ``` |
| - `images`: stacked `(N,C,H,W)` tensor or a list of CHW tensors. |
| - `size`: int (square), `(H,W)`, or `{"height","width"}`. |
| - `resample`: `"bilinear"`/`"bicubic"`, or a PIL resample int (0/2→bilinear, 3→bicubic). |
| - `backend`: `"separable"` (default, fastest) or `"fused"` (2-D reference). |
| - `resize_normalize_ragged`: same kernels, list-only. |
|
|
| --- |
|
|
| ## 8. Benchmark findings (A100, CUDA_VISIBLE_DEVICES=1) |
|
|
| ### Standalone resize+normalize — SigLIP-so400m config, N=32 ragged 384–1024², out 384², bicubic+AA |
| ``` |
| torchvision eager loop : 2.91 ms (per-image float loop) |
| torchvision compiled : 5.70 ms (torch.compile dynamic, per-image; slower than eager) |
| torchvision compiled pkt: 2.55 ms (one graph over a padded stack; timing only) |
| fused triton (2D) : 11.49 ms (taps*taps; the slow reference) |
| separable triton (uint8): 1.29 ms (taps+taps) <-- fastest |
| real processor : 3.92 ms |
| ``` |
| **Separable is ~3× the real processor**, parity ≤1e-4 vs torchvision-float. The fused 2-D |
| loses for the algorithmic reason above (169 vs 26 loads). `torch.compile` does not help: |
| per-image it is *slower* (dispatch overhead over 32 ragged shapes); even as one packed |
| graph it only matches the eager loop, because inductor's interpolate is no faster than aten |
| resize. |
|
|
| ### End-to-end inference — Siglip2-base-patch16-224, **bf16** forward |
| ``` |
| preprocess forward(fixed input) preprocess+forward |
| processor 3.99 12.86 14.44 |
| separable 0.93 13.02 13.76 <-- ~5% faster e2e |
| fused 2.00 13.01 14.79 |
| compiled 6.14 12.89 14.00 |
| feature parity (separable/fused/compiled vs processor): 9.38e-2 = 1.2% of feature max |
| ``` |
| - `forward(fixed input)` is identical (~12.9 ms) for all → **no inference regression**; the |
| model does not care which preprocessor made the tensor. |
| - The 1.2% feature drift is the float-vs-uint8 resize difference, identical across all |
| float backends → not a bug. The float path is the more accurate one. |
| - End-to-end win is ~5% with a bf16 forward (was ~0.5% with fp32, where the forward was |
| ~80 ms). **The win scales with how preprocessing-bound you are.** |
|
|
| ### Data path from JPEG bytes — 552 KB/img |
| ``` |
| CPU decode + torchvision resize : 177.5 ms (status quo) |
| CPU decode + separable kernel : 176.4 ms (kernel saves ~1 ms; decode dominates) |
| GPU decode (nvJPEG) + kernel : 14.8 ms (fully on-GPU) |
| ``` |
| - ~175 ms of the 177 ms is **CPU JPEG decode + host→device copy**. Resize/normalize is ~1%. |
| - The 12× win (177→15) is **GPU decode (nvJPEG)**, i.e. `torchvision.io.decode_jpeg(device="cuda")` |
| — *not* the kernel. The kernel is the resize/normalize component of that GPU pipeline. |
|
|
| --- |
|
|
| ## 9. What is true / what to claim |
|
|
| - The kernel is **correct** (≤1e-4 vs torchvision-float, more accurate than the processor's |
| uint8 path) and feeds the model with **no inference regression**. |
| - It is **~3× the real processor at the resize/normalize stage** — a real, parity-clean win. |
| - It does **not** speed up preprocessing 12×. Decode dominates the data path; the GPU-decode |
| lever is nvJPEG, a torchvision feature, not this kernel. |
| - The kernel matters end-to-end only once you are **not decode-bound**: in a GPU-decode |
| pipeline it keeps resize/normalize minimal (~10% of that pipeline), and its standalone |
| preprocess win shows up when the forward is small (bf16, small model, large batch). |
| - Honest one-liner: *"GPU-native resize+normalize, 3× the fast processor at that stage, |
| drop-in for a GPU-decode pipeline."* |
|
|