--- license: apache-2.0 base_model: facebook/EUPE-ViT-B tags: - semantic-segmentation - frozen-backbone - vision-transformer - architecture-search library_name: pytorch datasets: - scene_parse_150 --- # Segmentation Heads A systematic study of segmentation head architectures operating on frozen vision transformer features. Given a dense spatial feature grid from any frozen backbone, what is the most parameter-efficient architecture for per-pixel semantic classification? Standard practice treats the backbone and segmentation decoder as a joint system. Recent universal encoders produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on segmentation data. Under this regime, the head is the only variable. This repository contains an arena framework for rapid comparison of segmentation head candidates and a collection of architectures spanning conventional decoders through novel minimal-parameter designs. All heads consume the same spatial feature tensor and produce per-pixel class predictions. The reference backbone is [EUPE-ViT-B](https://huggingface.co/facebook/EUPE-ViT-B) (86M parameters, frozen), but the framework is backbone-agnostic — the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid. ## Heads Twelve architectures, all consuming a `[B, 768, H, W]` spatial feature tensor and producing `[B, 150, H_out, W_out]` ADE20K class logits. Each head lives in its own folder under `heads/` with a single `head.py` implementation. | Name | Architecture | Origin | |------|-------------|--------| | `linear_probe` | BatchNorm + 1×1 conv. The EUPE paper baseline. | Bolya et al., 2025 (PEspatial recipe) | | `cofiber_linear` | Adjoint cofiber decomposition + shared 1×1 conv per scale | Original | | `cofiber_threshold` | Cofiber decomposition + per-scale LayerNorm + prototype classification | Original | | `prototype_bank` | Per-class learned prototypes, cosine similarity, no conv | Original | | `wavelet` | Haar wavelet decomposition + per-subband classification | Original | | `patch_attention` | Each patch attends to its k nearest neighbors before classifying | Original | | `graph_crf` | k-NN graph in feature space, gated message passing | Original | | `hypercolumn_linear` | Concatenate features from intermediate ViT blocks, single linear layer | Hariharan et al., 2015 | | `info_bottleneck` | Project to d ≪ 768 dimensions, classify from the compressed representation | Original | | `tropical` | Tropical inner product replaces standard dot product | Original | | `compression` | Surprise-based feature modulation + linear classification | Original | | `curvature` | Discrete Riemannian curvature modulation + linear classification | Original | ## Arena Framework `arena.py` runs any head by name against cached ADE20K backbone features. The arena pre-extracts features once, then each candidate trains and evaluates without touching the backbone again. Training is cross-entropy at 512×512 resolution against the 150-class ADE20K label space; evaluation reports mean Intersection-over-Union (mIoU). ## Status Heads are implemented and importable through the `heads/` registry. The arena screening sweep across all 12 heads has not yet been run on a fresh ADE20K cache; results will be published here when available.