Detection Heads
A systematic study of detection head architectures operating on frozen vision transformer features. Given a dense spatial feature grid from any frozen backbone, what is the most parameter-efficient way to predict bounding boxes and class labels?
The standard approach to object detection treats the backbone and head as a joint system, trained or fine-tuned together. Recent universal encoders β EUPE, DINOv2, DINOv3, SigLIP2, RADIO β produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on detection data. Under this regime, the head is the only variable. Its architecture, parameter count, and inductive biases become the entire design problem.
This repository contains an arena framework for rapid comparison of detection head candidates and a collection of architectures ranging from conventional (FCOS with feature pyramid, CenterNet) through novel (prototype-contrastive classification, wavelet multi-scale decomposition, graph message passing, adjoint-scale decomposition from stable category theory). All heads consume the same [B, C, H, W] spatial feature tensor and produce per-image lists of bounding boxes with class labels and confidence scores. The backbone is interchangeable.
Motivation
The detection head for Argus (a multi-task perception system on a frozen EUPE-ViT-B backbone) uses an FCOS detector with a ViTDet-style simple feature pyramid. That head has 16.14 million parameters and reaches 41.0 mAP on COCO val2017 with the backbone frozen throughout. Parameter analysis reveals that 69.5% of those parameters (11.2M) are spent on the feature pyramid β synthesizing spatial scales that the backbone did not produce β while only 1.2% (0.2M) perform the actual classification and box regression. The question is whether architectures that avoid spatial synthesis entirely can match or exceed that performance at a fraction of the parameter cost.
Arena Framework
The arena pre-caches backbone features for a subset of the training set, then each candidate head trains and evaluates against the cached features without touching the backbone. This removes the backbone forward pass (which dominates per-step wall time) and reduces each 500-step training run to under two minutes.
1. Run frozen backbone over N training images. Cache spatial features
and intermediate block outputs to disk. One-time cost.
2. For each candidate: load cached features, train for 500 steps,
run inference on diagnostic images, report loss curve and detections.
Evaluation Protocol
Backbone: EUPE-ViT-B (86M parameters, frozen). Features cached at 640-pixel input with letterbox padding. Spatial output: 768 channels, 40Γ40 patches (stride 16). Intermediate block outputs [2, 5, 8, 11] also cached for heads that use them. The arena framework is backbone-agnostic β the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.
Domains: 21 total. COCO (80 classes) plus 20 RF100-VL datasets in COCO annotation format:
| Domain | Category | Classes | Train images |
|---|---|---|---|
| coco | Standard | 80 | 300 (cached from 117K) |
| actions | Sport | 6 | 300 |
| aerial-airport | Aerial | 1 | 233 |
| all-elements | Document | 10 | 300 |
| aquarium-combined | Flora/Fauna | 7 | 300 |
| defect-detection | Industrial | 4 | 300 |
| dentalai | Lab Imaging | 4 | 300 |
| flir-camera-objects | Misc | 4 | 300 |
| gwhd2021 | Flora/Fauna | 1 | 300 |
| lacrosse-object-detection | Sport | 4 | 300 |
| new-defects-in-wood | Misc | 5 | 300 |
| orionproducts | Misc | 8 | 300 |
| paper-parts | Document | 19 | 300 |
| recode-waste | Industrial | 6 | 300 |
| soda-bottles | Misc | 3 | 300 |
| the-dreidel-project | Misc | 6 | 300 |
| trail-camera | Flora/Fauna | 2 | 300 |
| water-meter | Industrial | 10 | 300 |
| wb-prova | Flora/Fauna | 3 | 300 |
| wildfire-smoke | Aerial | 1 | 300 |
| x-ray-id | Lab Imaging | 6 | 300 |
Training: Each head trained from scratch per domain. 300 cached training images, random sampling with replacement at batch 1. Seed 42 for weight initialization and batch sampling. AdamW optimizer, lr 1e-3, weight decay 1e-4, gradient clipping at 5.0. No learning rate schedule, no augmentation (features are pre-extracted).
Screening evaluation: 10 validation images per domain with ground truth boxes. Precision and recall at IoU β₯ 0.5 with class matching. Score threshold 0.3. Metrics averaged across all 21 domains. The screening table below was computed on this protocol. A rerun at 1000 validation images per domain is pending.
COCO mAP evaluation: Full pycocotools protocol on 5,000 COCO val2017 images. mAP@[0.5:0.95] with standard IoU thresholds. Used for the converged Cofiber Threshold results reported below the screening table.
Screening: 2,000 steps (2,000 images seen). Each head sees the same training images in the same order due to the shared seed.
Extended training: 15,000 steps (15,000 images seen, ~50 passes over the 300-image cache). Same protocol otherwise.
Cross-Domain Screening (18 heads, 10 val images per domain)
The following table was computed on 10 validation images per domain. These results are coarse screening signals, not converged evaluations. A rerun at 1000 validation images per domain is pending.
| Name | Architecture | Params | Avg Prec | Avg Recall | Total TP |
|---|---|---|---|---|---|
| Wavelet | Haar wavelet multi-scale, no learned downsampling | 1.02M | 0.724 | 0.305 | 876 |
| Depth Fusion | Shallow + deep ViT block fusion, depth conditioning | 0.99M | 0.656 | 0.247 | 879 |
| Hook FCOS | FCOS on reassembled intermediate blocks [2, 5, 8, 11] | 2.42M | 0.629 | 0.349 | 1070 |
| Slim FCOS | FCOS, 128 channels, 2-layer towers | 2.51M | 0.575 | 0.310 | 1027 |
| Prototype Contrastive | 80 learned class vectors, cosine similarity | 1.41M | 0.571 | 0.248 | 866 |
| CenterNet | Heatmap + offset + size on stride-16, no FPN | 2.38M | 0.548 | 0.511 | 1267 |
| Threshold Prototype | Single-layer class prototypes, Heaviside-compatible | 0.07M | 0.534 | 0.143 | 475 |
| Cofiber Threshold | Adjoint cofiber decomposition + LayerNorm + threshold | 0.07M | 0.475 | 0.193 | 478 |
| Baseline FCOS | Standard FCOS + ViTDet simple feature pyramid | 16.14M | 0.470 | 0.245 | 969 |
| Feature Graph | k-NN graph in feature space, gated message passing | 2.81M | 0.385 | 0.150 | 712 |
| Cofiber Linear | Adjoint cofiber decomposition, shared linear heads | 0.07M | 0.236 | 0.082 | 251 |
| Scale Classify | Scale as discrete bins, not continuous regression | 1.62M | 0.077 | 0.390 | 916 |
| Adaptive Query | Image-derived queries via attention pooling | 1.10M | 0.018 | 0.018 | 20 |
| Cascade Pool | 400β100β50 queries, progressive pruning | 2.76M | 0.001 | 0.001 | 3 |
| Sparse Query | 100 fixed learned queries, 2 cross-attention layers | 1.89M | 0.002 | 0.001 | 7 |
| Patch Assembly | Per-patch centroid voting, box from grouped votes | 5.65M | 0.000 | 0.000 | 0 |
| Mutual Attention | 6 self-attention blocks over all patch tokens | 5.13M | 0.000 | 0.000 | 0 |
| Relational Corners | Objects as scored patch pairs, low-rank bilinear | 3.09M | 0.000 | 0.000 | 0 |
Candidates with zero precision have not crossed the confidence threshold for detection at this training horizon; they may produce detections with longer training.
Additional heads (not yet screened across domains)
| Name | Architecture | Params |
|---|---|---|
| Cofiber CenterNet | Cofiber decomposition + heatmap loss | 0.07M |
| Cofiber 5-Scale | 5-level cofiber decomposition | 0.07M |
| Cofiber Adaptive | Learned pool sizes | 0.07M |
| Optimal Transport | Gaussian class distributions, Mahalanobis classification | 0.13M |
| Tropical | Tropical algebra inner product replaces standard dot product | 0.07M |
| Compression | Surprise-based spatial filtering + prototype classification | 0.07M |
| Curvature | Discrete Riemannian curvature modulation + prototype classification | 0.07M |
References
- Baseline FCOS, Slim FCOS, Hook FCOS: Tian et al., "FCOS: Fully Convolutional One-Stage Object Detection," ICCV 2019. Hook variant uses intermediate ViT features from Li et al., "Exploring Plain Vision Transformer Backbones for Object Detection," ECCV 2022 (ViTDet).
- CenterNet: Zhou et al., "Objects as Points," arXiv 2019.
- Sparse Query, Adaptive Query, Cascade Pool: Conceptually related to Carion et al., "End-to-End Object Detection with Transformers," ECCV 2020 (DETR).
- Feature Graph: Related to Hu et al., "Relation Networks for Object Detection," CVPR 2018. The k-NN construction in feature space rather than spatial space is original to this work.
- Depth Fusion: Uses intermediate ViT features (Li et al. 2022); depth-conditioned detection is related to Wang et al., "FCOS3D," NeurIPS 2021. The unsupervised depth signal from detection loss alone is original.
- Wavelet, Prototype Contrastive, Scale Classify, Patch Assembly, Mutual Attention, Relational Corners: Original architectures.
- Cofiber Linear, Cofiber Threshold, Threshold Prototype: Original architectures. The cofiber decomposition is derived from adjoint functors in stable category theory; see
phanerozoic/threshold-cofiber-detectionfor the machine-checked proof.
Adjoint-scale decomposition
The feature pyramid in conventional detection heads is a learned approximation of multi-scale decomposition. Three candidates in the arena (Cofiber Linear, Threshold Prototype, Cofiber Threshold) replace it with an analytic decomposition using the adjoint structure of upsampling and downsampling operators. Bilinear interpolation (upsample, zero parameters) and average pooling (downsample, zero parameters) form an adjoint pair. The cofiber of the round-trip map β f - upsample(downsample(f)) β isolates the information present at a given scale but absent at the next coarser scale.
- Cofiber Linear: Analytic decomposition followed by shared linear prediction heads across all scales. 65,000 parameters.
- Threshold Prototype: Single-layer prototype classification on the raw stride-16 features, no scale decomposition. Integer-weight-compatible, deployable on neuromorphic hardware. 65,000 parameters.
- Cofiber Threshold: Analytic decomposition followed by per-scale LayerNorm and prototype classification. Combines the multi-scale structure of the first with the threshold-compatible prediction of the second. 65,000 parameters.
The decomposition is derived from the structure of adjoint functors and distinguished triangles in stable category theory. The cofiber of the unit map Ξ·: 1 β ΩΣ in a stable category decomposes an object into its per-scale components via an exact sequence. Applied to the spatial feature grid, this gives a lossless multi-scale representation at zero parameter cost.
Extended training (15,000 steps)
Threshold Prototype and Cofiber Threshold were trained for 15,000 steps at batch 1 across all 21 domains. Both improved substantially over the 2,000-step screening.
| Name | Params | Avg Prec (2K) | Avg Prec (15K) | Avg Recall (15K) | Total TP (15K) |
|---|---|---|---|---|---|
| Threshold Prototype | 0.07M | 0.534 | 0.627 | 0.242 | 597 |
| Cofiber Threshold | 0.07M | 0.475 | 0.617 | 0.368 | 719 |
Both 65K-parameter heads exceed the Baseline FCOS (16.14M, 0.470 avg precision at 2,000 steps) on cross-domain precision at 230x fewer parameters. Cofiber Threshold produces 51% more true positives than Threshold Prototype (719 vs 597), reflecting the advantage of multi-scale cofiber decomposition over single-stride prediction at longer training horizons.
Cofiber Threshold variants (15,000 steps)
Three variants of Cofiber Threshold were tested to probe whether the original 3-scale fixed decomposition is optimal.
| Variant | Architecture | Params | Avg Prec | Avg Recall | Total TP |
|---|---|---|---|---|---|
| Cofiber Threshold (original) | 3 fixed scales, FCOS loss | 0.07M | 0.617 | 0.368 | 719 |
| Cofiber Adaptive | Learned pool sizes, FCOS loss | 0.07M | 0.584 | 0.345 | 759 |
| Cofiber 5-Scale | 5 fixed scales, FCOS loss | 0.07M | 0.579 | 0.356 | 724 |
| Cofiber CenterNet | 3 fixed scales, heatmap loss | 0.07M | 0.190 | 0.573 | 1238 |
The original design leads on precision. Additional scales, adaptive boundaries, and an alternative loss formulation did not improve over the fixed 3-scale FCOS configuration. The CenterNet variant produces the highest recall and total detections but at substantially lower precision.
COCO val2017 (pycocotools, 5000 images)
Cofiber Threshold variants trained on full COCO 2017 train (117,266 images). Frozen EUPE-ViT-B backbone, 640-pixel input with letterbox padding. Evaluated with pycocotools on the standard 5000-image val set.
| Variant | Box regression | Params | Nonzero | mAP@[0.5:0.95] | mAP@0.50 | mAP@0.75 |
|---|---|---|---|---|---|---|
| linear_70k | 768β4 | 69,976 | 69,976 | 4.0 | 15.8 | 0.8 |
| box32_92k | 768β32β4 | 91,640 | 91,640 | 5.7 | 20.6 | 1.3 |
| box32 pruned R1 | 768β32β4 | 91,640 | 76,640 | 5.7 | 20.7 | 1.3 |
| box32 pruned R2 | 768β32β4 | 91,640 | ~62,000 | 5.9 | 20.4 | 1.5 |
| box32 pruned R3 | 768β32β4 | 91,640 | ~47,000 | 5.1 | 17.1 | 1.4 |
| dim20 | 768β20β80 cls, 20β16β4 reg | 22,076 | 22,076 | 3.9 | 14.8 | 0.9 |
| dim20 R1 (project 25.7% sparse) | 768β20β80 cls | 22,076 | 18,121 | 3.9 | 14.6 | 0.8 |
| dim20 R2 (project 26.6% sparse) | 768β20β80 cls | 22,076 | 17,988 | 3.8 | 14.5 | 0.7 |
| dim20 cls_weight pruned (37%) | 596 of 1600 cls weights zeroed | 22,076 | 21,480 | 3.8 | 14.4 | 0.7 |
| dim20 reg_hidden pruned (17%) | 55 of 320 reg weights zeroed | 22,076 | 22,021 | 3.8 | 14.5 | 0.7 |
| dim20 reg_out pruned (12%) | 8 of 64 reg weights zeroed | 22,076 | 22,068 | 3.8 | 14.5 | 0.7 |
| dim20 ctr_weight pruned (90%) | 18 of 20 ctr weights zeroed | 22,076 | 22,058 | 3.7 | 14.2 | 0.7 |
| dim20 R1 + cls greedy | project 25.7% + cls 45% sparse | 22,076 | 17,406 | 3.5 | 13.4 | 0.6 |
| dim20 joint (from R1) | whole-head magnitude pruning | 22,076 | 17,129 | 3.6 | 13.7 | 0.6 |
| dim15 | 768β15β80 cls, 15β16β4 reg | 17,751 | 17,751 | 3.0 | 11.5 | 0.7 |
| dim10 | 768β10β80 cls, 10β16β4 reg | 13,426 | 13,426 | 1.5 | 5.6 | 0.4 |
| dim5 | 768β5β80 cls, 5β16β4 reg | 9,101 | 9,101 | 0.3 | 1.3 | 0.1 |
Pruning improved mAP from 5.7 to 5.9 at R2 (~62K nonzero) by removing noisy prototype weights. R3 pushed past the degradation threshold. SVD analysis of the R2 prototypes showed effective rank ~20 for 72% energy retention, motivating the dim20 variant: a 768β20 bottleneck projection followed by 20β80 classification, initialized from the SVD vectors of the pruned prototypes. Dim20 produces 3.9 mAP from 22,076 parameters β an 80-class COCO detector in 22K params.
The dim15, dim10, and dim5 variants push the bottleneck further with the same SVD-initialization recipe applied to the top 15, top 10, and top 5 directions of the pruned R2 prototype matrix. Dim15 (17,751 parameters, 67% SVD energy retention) reaches 3.0 mAP. Dim10 (13,426 parameters, 61% energy retention) reaches 1.5 mAP β the smallest 80-class COCO detector to clear the 1.0 mAP threshold. Dim5 (9,101 parameters, 53% energy retention) drops to 0.3 mAP. The mAP scaling across dim20 β dim15 β dim10 is roughly geometric (3.9 β 3.0 β 1.5), but reverses sharply between 10 and 5 dimensions where the curve falls off a cliff. Five directions sit below the intrinsic capacity needed for 80-class separation; the floor lies between 5 and 10 bottleneck dimensions, and finer probes at dim7/dim8 would localize the exact cliff.
Dim20 was then itself put through magnitude pruning. The mAP-driven pruner bisects over the magnitude-sorted weight list of a chosen parameter, uses full pycocotools mAP@[0.5:0.95] as the retention metric (1000 val images), and rolls back any pass that fails the 95% retention floor on full verification. It was run separately on each learned parameter of dim20 plus a joint-magnitude variant that ranks every weight in the head against every other.
The leading Pareto point is R1 (project layer 25.7% sparse, 18,121 nonzero, 3.9 mAP) β same mAP as unpruned dim20 with 18% fewer effective parameters, the highest mAP-per-10K-parameter ratio in the table at 2.15. R2 pushes to 26.6% project sparsity (17,988 nonzero) at a small mAP cost (3.8). Per-parameter slack measurements:
project.weight: 26.6% prunable (the sparsity that produced R1/R2)cls_weight: 37% prunable in isolation, 3.8 mAP at 21,480 nonzeroreg_hidden.weight: 17% prunable, 3.8 mAP at 22,021 nonzeroreg_out.weight: 12% prunable, 3.8 mAP at 22,068 nonzeroctr_weight: 90% prunable (only 2 of 20 centerness weights load-bear), 3.7 mAP at 22,058 nonzero
Greedy stacking of cls_weight pruning on top of R1 reaches 17,406 nonzero but drops to 3.5 mAP β interaction between the parameters: the cls_weight slack measured against unpruned dim20 partly comes from compensating for the surviving project subspace, so removing it after pruning project costs more mAP than the per-parameter measurement suggested. Joint magnitude pruning across all 22K head weights (starting from R1) finds 17,129 nonzero at 3.6 mAP, which is the smallest dim20 found but does not Pareto-dominate R1 β the bisection's 1000-image mAP proxy was systematically optimistic relative to the full 5000-image eval, so the 95% retention floor measured during pruning gave more aggressive cuts than the full eval would have accepted. R1 remains the leading point of the dim20 pruning Pareto.
Training recipe
| Variant | Script | Epochs | Batch | Optimizer | LR | Schedule | Initialization |
|---|---|---|---|---|---|---|---|
| linear (linear_70k) | cache_and_train_fast.py (legacy train.py) |
8 | 64 | AdamW (wd 1e-4) | 1e-3 | cosine, 3% warmup | random |
| box32 (box32_92k) | cache_and_train_fast.py (legacy train.py) |
8 | 64 | AdamW (wd 1e-4) | 1e-3 | cosine, 3% warmup | random |
| box32 pruned R1/R2/R3 | prune.py (legacy TP-retention pruner) |
β | β | β | β | β | from box32 checkpoint |
| dim20 | cache_and_train_fast.py --dim 20 |
8 | 64 | AdamW (wd 1e-4) | 1e-3 | cosine, 3% warmup | SVD of pruned R2 prototypes |
| dim15 | cache_and_train_fast.py --dim 15 |
8 | 128 | AdamW (wd 1e-4) | 1e-3 | cosine, 3% warmup | SVD of pruned R2 prototypes + analytical least-squares cls init from cached features |
| dim10 | cache_and_train_fast.py --dim 10 |
8 | 128 | AdamW (wd 1e-4) | 1e-3 | cosine, 3% warmup | SVD of pruned R2 prototypes + analytical least-squares cls init from cached features |
| dim5 | cache_and_train_fast.py --dim 5 |
8 | 128 | AdamW (wd 1e-4) | 1e-3 | cosine, 3% warmup | SVD of pruned R2 prototypes + analytical least-squares cls init from cached features |
| dim20 pruned R1/R2 | prune.py --param project.weight |
β | β | β | β | β | from dim20 checkpoint, mAP-driven bisection pruner |
The original box32 pruning sequence used an early TP-retention pruner that was later shown to be too lenient (it accepted weight reductions that increased TP via false positives). The dim20 pruning was done with the rewritten prune.py that scores by full pycocotools mAP, bisects over the magnitude-sorted weight list, and rolls back any pass that fails the 95% mAP retention floor on the full 1000-image validation subset.
Checkpoints
Trained weights and per-checkpoint pycocotools eval JSONs live in heads/cofiber_threshold/<variant>/ (the <variant> folder name is the simple variant identifier β linear, box32, dim5, dim10, dim15, dim20, etc.). The same files are mirrored to phanerozoic/threshold-cofiber-detection via scripts/build_threshold_repo.py.
Repository Structure
Each head is a self-contained folder under heads/ with its own implementation and documentation. Shared loss functions are in losses/, shared utilities in utils/. The arena scripts (arena.py, multi_domain_arena.py) run any head by name against cached backbone features.
License
The arena framework and all candidate architectures are released under the Apache 2.0 license. The framework is backbone-agnostic; users are responsible for complying with the license terms of whatever backbone they use to generate cached features.
Model tree for phanerozoic/detection-heads
Base model
facebook/EUPE-ViT-B