Molbap HF Staff

Upload folder using huggingface_hub

e199518 verified 11 days ago

14.1 kB

	# kernel_image_resize — how every op works (study notes)

	This explains the whole package end to end: the resampling math, the data layout, and
	every kernel op, plus the benchmark findings. It is meant for pen & paper — there is a
	fully worked numeric example you can reproduce by hand in the "Worked example" section.

	The package does one thing: resize + rescale + normalize, the op sequence a
	`transformers` fast image processor runs (`TorchvisionBackend`: resize, then
	`(x*rescale - mean)/std`), as a single GPU pipeline. Input: raw CHW `uint8` images
	(any size, ragged). Output: `(N, C, out_h, out_w)` normalized `float32`.

	---

	## 1. The one idea behind resizing

	Resizing does not "pick" pixels; each output pixel is a **weighted average of a small
	window of input pixels**. Two things define that average:

	1. Where in the input an output pixel lands (its center).
	2. Which input pixels are in its window, and with what weights.

	The weight of an input pixel falls off with distance from the center, following a filter
	curve:

	- bilinear → a triangle, width 1 on each side (so 2 input pixels per axis normally).
	- bicubic → a cubic bump, width 2 on each side (so 4 input pixels per axis normally).

	When you shrink an image, you must also blur first or you get aliasing. That is
	what `antialias=True` does: it widens the window so each output pixel averages more input
	pixels (a low-pass filter before throwing pixels away). Widening is proportional to the
	shrink factor, so shrinking 3× turns a 4-tap bicubic into ~13 taps.

	---

	## 2. The resampling-weight formula (the heart of everything)

	All kernels use the same formula, which matches PyTorch's aten `UpSampleKernel`
	(`align_corners=False`, "half-pixel" convention). For one axis:

	```
	scale = in_size / out_size # > 1 means shrinking
	interp_half = 1 (bilinear) or 2 (bicubic) # half-width of the filter
	cubic_a = -0.75 (no antialias) or -0.5 (antialias) # the cubic curve's shape constant

	# antialias only widens the window, and only when shrinking:
	if antialias and scale > 1:
	eff = scale # window widens by the shrink factor
	else:
	eff = 1 # plain 2-tap / 4-tap interpolation
	support = interp_half * eff # half-window width, in INPUT pixels
	inv = 1 / eff # squashes the filter curve to match the widened window

	# for output index i:
	center = scale * (i + 0.5) # input coordinate this output maps to
	first_tap = floor(center - support + 0.5) # leftmost input pixel in the window

	# for each tap t = 0, 1, 2, ... (up to MAX_TAPS):
	tap_pos = first_tap + t # an input pixel index
	arg = (tap_pos - center + 0.5) * inv # distance from center, squashed
	weight = filter(arg) # triangle or cubic, see below
	```

	The filter (`_resample_weight` in `_fused.py`), with `x = \|arg\|`:

	```
	bilinear: max(1 - x, 0) # triangle, zero past 1

	bicubic: x <= 1 : (a+2)x^3 - (a+3)x^2 + 1
	1<x<2 : a x^3 - 5a x^2 + 8a x - 4a
	else : 0
	```

	Two edge rules (both kernels do this identically):

	- non-antialias: clamp the tap index into `[0, in_size-1]` → replicates the border
	pixel. The filter weights of a standard 2/4-tap interpolation already sum to 1.
	- antialias: instead set the weight to 0 for taps that fall off the image
	(`tap_pos < 0` or `>= in_size`), then renormalize by dividing by the sum of the
	realized weights. This keeps the average correct at the edges.

	That renormalization is why every kernel computes a `weight_sum` and divides by it. For
	the non-antialias case `weight_sum == 1`, so the division is a harmless no-op.

	---

	## 3. Worked example (do this by hand)

	bilinear, no antialias, one axis, in_size=4, out_size=2.

	```
	scale = 4/2 = 2, interp_half = 1, eff = 1, support = 1, inv = 1
	```

	Output pixel `i = 0`:
	```
	center = 2 * (0 + 0.5) = 1.0
	first_tap = floor(1.0 - 1 + 0.5) = floor(0.5) = 0
	t=0: tap_pos=0, arg=(0-1.0+0.5)= -0.5 -> weight = 1-0.5 = 0.5
	t=1: tap_pos=1, arg=(1-1.0+0.5)= 0.5 -> weight = 1-0.5 = 0.5
	t=2: tap_pos=2, arg=(2-1.0+0.5)= 1.5 -> weight = max(1-1.5,0) = 0
	weight_sum = 1.0
	output[0] = (0.5in[0] + 0.5in[1]) / 1.0 # halfway between in[0] and in[1]
	```

	Output pixel `i = 1`:
	```
	center = 2 * 1.5 = 3.0
	first_tap = floor(3.0 - 0.5) = 2
	t=0: tap_pos=2, arg=-0.5 -> 0.5
	t=1: tap_pos=3, arg= 0.5 -> 0.5
	t=2: tap_pos=4, arg= 1.5 -> 0 (index 4 would clamp to 3, but weight is 0 anyway)
	output[1] = 0.5in[2] + 0.5in[3]
	```

	This 1-D operation is exactly one pass of the separable kernel. The 2-D result is the same
	formula applied on both axes (rows and columns).

	---

	## 4. Data layout (host side, `_pack.py`)

	Ragged images (different H×W) cannot be stacked into one tensor, so they are flattened and
	concatenated into one buffer, with side tables describing each image.

	`pack_images(images, dtype)` →
	```
	input_pixels : 1-D buffer, all images flattened (C,H,W row-major) and concatenated
	offsets[n] : element index where image n starts
	heights[n], widths[n] : that image's H and W
	channels : C (shared by all images)
	```
	Address of input pixel `(channel, row, col)` of image `n`:
	```
	input_pixels[ offsets[n] + channel(HW) + row*W + col ]
	```
	The separable path packs as `uint8` (1 byte/pixel, half the memory traffic of float).

	`fold_mean_std(mean, std, rescale)` → folds the rescale factor into the normalization
	constants so the kernel does a single `(x - m)/s`:
	```
	m = mean / rescale s = std / rescale
	(x - m)/s == (x*rescale - mean)/std # identical to the processor's fused normalize
	```

	`max_taps(images, out_size, axis, interp, antialias)` → the widest window in the batch
	= `ceil(support) * 2 + 1`. A Triton loop bound must be a compile-time constant, so every
	program loops this fixed count; taps beyond a given pixel's real window get ~0 weight.

	`as_image_list` → accepts a stacked `(N,C,H,W)` tensor or a list, always returns a list.

	---

	## 5. Fused kernel (`_fused.py`, `backend="fused"`)

	One launch. One program = one image + a BLOCK of its output pixels. Each output pixel
	reads the full 2-D window directly: `MAX_TAPS_H × MAX_TAPS_W` input pixels.

	```
	grid = (num_images, ceil(out_h*out_w / BLOCK))

	per lane (one output pixel):
	oy, ox = (flat_index // out_w, flat_index % out_w)
	center_y, center_x, first_tap_y, first_tap_x # section 2, both axes

	# weight_sum factorizes across axes (separable math, even though the LOADS are 2-D):
	sum_wy = Σ_ty filter_y ; sum_wx = Σ_tx filter_x ; denom = sum_wy * sum_wx

	for channel:
	acc = 0
	for ty in 0..MAX_TAPS_H: # <-- the 2-D window: TAPS_H * TAPS_W loads
	for tx in 0..MAX_TAPS_W:
	weight = filter_y(ty) * filter_x(tx)
	pixel = input_pixels[channel, clamp(tap_y), clamp(tap_x)]
	acc += weight * pixel
	acc = acc / denom
	out[image, channel, oy, ox] = (acc - mean[channel]) / std[channel]
	```

	Cost per output pixel: `TAPS_H * TAPS_W` loads (e.g. 13×13 = 169). Correct and simple,
	but the 2-D load count is what makes it slow — hence the separable version.

	---

	## 6. Separable kernel (`_separable.py`, `backend="separable"`, the default)

	Same math, but the 2-D window is done as two 1-D passes, with a float intermediate
	buffer in between. Loads per output pixel: `TAPS_W + TAPS_H` (e.g. 13+13 = 26).

	```
	input_pixels (uint8, C×H×W) --pass1--> intermediate (float, C×H×out_w) --pass2--> output (float, C×out_h×out_w)
	resize W resize H + normalize
	```

	The intermediate is the key object: same height as the input, but already the
	final width. ("Tall and narrow.") It is also ragged in height, so it gets its own
	offset table (built in `separable_resize_normalize`, same scheme as `pack_images`).

	### Pass 1 — `_horizontal_resize_kernel` (resize width only)

	```
	grid = (num_images, ceil(H*out_w / BLOCK)) # work = every input row × every output col
	per lane:
	input_row = flat_index // out_w # row index, UNCHANGED by this pass
	out_col = flat_index % out_w # output column being computed

	center_x, first_tap_x, col_weight_sum # section 2, COLUMN axis only

	for channel:
	acc = 0
	for tap in 0..MAX_TAPS_COL: # <-- 1-D: only TAPS_W loads
	weight = filter_x(tap)
	pixel = input_pixels[channel, input_row, clamp(tap_col)] # uint8 -> float
	acc += weight * pixel
	acc = acc / col_weight_sum
	intermediate[channel, input_row, out_col] = acc # NO normalize yet
	```

	Reads original `uint8` bytes; writes `float32`. No normalization here.

	### Pass 2 — `_vertical_resize_normalize_kernel` (resize height, then normalize)

	```
	grid = (num_images, ceil(out_h*out_w / BLOCK)) # work = every output pixel
	per lane:
	out_row = flat_index // out_w
	out_col = flat_index % out_w

	center_y, first_tap_y, row_weight_sum # section 2, ROW axis only

	for channel:
	acc = 0
	for tap in 0..MAX_TAPS_ROW: # <-- 1-D: only TAPS_H loads
	weight = filter_y(tap)
	pixel = intermediate[channel, clamp(tap_row), out_col] # float
	acc += weight * pixel
	acc = acc / row_weight_sum
	out[image, channel, out_row, out_col] = (acc - mean[channel]) / std[channel] # normalize here
	```

	Two launches (an implicit sync between them), so pass 2 always sees pass 1's finished
	output.

	### Why separable wins

	`TAPS_W + TAPS_H` loads instead of `TAPS_W * TAPS_H`. For a 13×13 window that is 26 vs 169.
	This is exactly the algorithm PIL and torchvision use. The catch: an extra full-size float
	intermediate buffer (more memory traffic), but the read-count reduction dominates.

	Parity note: the intermediate here is float32; torchvision keeps a **fixed-point
	uint8** intermediate. So the separable output is parity-close to torchvision, not
	bit-identical — and the float version is actually the more accurate one.

	---

	## 7. Public API (`__init__.py`)

	```
	resize_normalize(images, size, image_mean, image_std,
	rescale_factor=1/255, resample="bilinear", antialias=False,
	backend="separable", block=256)
	```
	- `images`: stacked `(N,C,H,W)` tensor or a list of CHW tensors.
	- `size`: int (square), `(H,W)`, or `{"height","width"}`.
	- `resample`: `"bilinear"`/`"bicubic"`, or a PIL resample int (0/2→bilinear, 3→bicubic).
	- `backend`: `"separable"` (default, fastest) or `"fused"` (2-D reference).
	- `resize_normalize_ragged`: same kernels, list-only.

	---

	## 8. Benchmark findings (A100, CUDA_VISIBLE_DEVICES=1)

	### Standalone resize+normalize — SigLIP-so400m config, N=32 ragged 384–1024², out 384², bicubic+AA
	```
	torchvision eager loop : 2.91 ms (per-image float loop)
	torchvision compiled : 5.70 ms (torch.compile dynamic, per-image; slower than eager)
	torchvision compiled pkt: 2.55 ms (one graph over a padded stack; timing only)
	fused triton (2D) : 11.49 ms (taps*taps; the slow reference)
	separable triton (uint8): 1.29 ms (taps+taps) <-- fastest
	real processor : 3.92 ms
	```
	Separable is ~3× the real processor, parity ≤1e-4 vs torchvision-float. The fused 2-D
	loses for the algorithmic reason above (169 vs 26 loads). `torch.compile` does not help:
	per-image it is slower (dispatch overhead over 32 ragged shapes); even as one packed
	graph it only matches the eager loop, because inductor's interpolate is no faster than aten
	resize.

	### End-to-end inference — Siglip2-base-patch16-224, bf16 forward
	```
	preprocess forward(fixed input) preprocess+forward
	processor 3.99 12.86 14.44
	separable 0.93 13.02 13.76 <-- ~5% faster e2e
	fused 2.00 13.01 14.79
	compiled 6.14 12.89 14.00
	feature parity (separable/fused/compiled vs processor): 9.38e-2 = 1.2% of feature max
	```
	- `forward(fixed input)` is identical (~12.9 ms) for all → no inference regression; the
	model does not care which preprocessor made the tensor.
	- The 1.2% feature drift is the float-vs-uint8 resize difference, identical across all
	float backends → not a bug. The float path is the more accurate one.
	- End-to-end win is ~5% with a bf16 forward (was ~0.5% with fp32, where the forward was
	~80 ms). The win scales with how preprocessing-bound you are.

	### Data path from JPEG bytes — 552 KB/img
	```
	CPU decode + torchvision resize : 177.5 ms (status quo)
	CPU decode + separable kernel : 176.4 ms (kernel saves ~1 ms; decode dominates)
	GPU decode (nvJPEG) + kernel : 14.8 ms (fully on-GPU)
	```
	- ~175 ms of the 177 ms is CPU JPEG decode + host→device copy. Resize/normalize is ~1%.
	- The 12× win (177→15) is GPU decode (nvJPEG), i.e. `torchvision.io.decode_jpeg(device="cuda")`
	— not the kernel. The kernel is the resize/normalize component of that GPU pipeline.

	---

	## 9. What is true / what to claim

	- The kernel is correct (≤1e-4 vs torchvision-float, more accurate than the processor's
	uint8 path) and feeds the model with no inference regression.
	- It is ~3× the real processor at the resize/normalize stage — a real, parity-clean win.
	- It does not speed up preprocessing 12×. Decode dominates the data path; the GPU-decode
	lever is nvJPEG, a torchvision feature, not this kernel.
	- The kernel matters end-to-end only once you are not decode-bound: in a GPU-decode
	pipeline it keeps resize/normalize minimal (~10% of that pipeline), and its standalone
	preprocess win shows up when the forward is small (bf16, small model, large batch).
	- Honest one-liner: *"GPU-native resize+normalize, 3× the fast processor at that stage,
	drop-in for a GPU-decode pipeline."*