Prism
A multi-task head that attaches to a frozen vision backbone and produces detection, segmentation, and depth predictions from a single shared feature decomposition. Developed on EUPE-ViT-B (86M parameters, Meta FAIR) but applicable to any backbone that outputs dense spatial features.
The spatial features from the backbone are separated into scale bands by iteratively subtracting a downsampled-then-upsampled copy of the features from the features themselves. Each subtraction isolates the information present at one spatial resolution but absent from the next coarser resolution. This decomposition requires no learned parameters β it is a fixed sequence of average pooling, bilinear upsampling, and subtraction. The resulting scale bands are the input to all task heads.
Each task is a linear projection on the shared scale bands. Detection, segmentation, and depth differ only in their output weight matrices. The decomposition and the per-scale normalization are shared across all tasks and computed once per image.
Image β EUPE-ViT-B (frozen, 86M params) β spatial features [768, H, W]
Scale decomposition (0 learned params):
band_0 = features - upsample(pool(features)) stride 16
band_1 = pool(features) - upsample(pool(pool(features))) stride 32
band_2 = pool(pool(features)) stride 64
Task outputs (one linear layer each):
βββ Detection (768 β 85 per location) ~65K params
βββ Segmentation (768 β 150 per location) ~115K params
βββ Depth (768 β 256 per location) ~197K params
βββ [any spatial task] (768 β N)
All three task heads combined: approximately 377K learned parameters. The decomposition and normalization add zero. For comparison, the standard Argus multi-task heads (FCOS detection + linear segmentation + DPT depth) total approximately 30M parameters on the same backbone.
The scale decomposition is derived from the adjoint structure of the upsample/downsample pair. A machine-checked proof of the decomposition's cross-scale orthogonality properties is available at phanerozoic/threshold-cofiber-detection.