Detection Heads

A systematic study of detection head architectures operating on frozen vision transformer features. Given a dense spatial feature grid from any frozen backbone, what is the most parameter-efficient way to predict bounding boxes and class labels?

The standard approach to object detection treats the backbone and head as a joint system, trained or fine-tuned together. Recent universal encoders β€” EUPE, DINOv2, DINOv3, SigLIP2, RADIO β€” produce spatial features of sufficient quality that the backbone can remain frozen while a lightweight head is trained on detection data. Under this regime, the head is the only variable. Its architecture, parameter count, and inductive biases become the entire design problem.

This repository contains an arena framework for rapid comparison of detection head candidates and a collection of architectures ranging from conventional (FCOS with feature pyramid, CenterNet) through novel (prototype-contrastive classification, wavelet multi-scale decomposition, graph message passing, adjoint-scale decomposition from stable category theory). All heads consume the same [B, C, H, W] spatial feature tensor and produce per-image lists of bounding boxes with class labels and confidence scores. The backbone is interchangeable.

Motivation

The detection head for Argus (a multi-task perception system on a frozen EUPE-ViT-B backbone) uses an FCOS detector with a ViTDet-style simple feature pyramid. That head has 16.14 million parameters and reaches 41.0 mAP on COCO val2017 with the backbone frozen throughout. Parameter analysis reveals that 69.5% of those parameters (11.2M) are spent on the feature pyramid β€” synthesizing spatial scales that the backbone did not produce β€” while only 1.2% (0.2M) perform the actual classification and box regression. The question is whether architectures that avoid spatial synthesis entirely can match or exceed that performance at a fraction of the parameter cost.

Arena Framework

The arena pre-caches backbone features for a subset of the training set, then each candidate head trains and evaluates against the cached features without touching the backbone. This removes the backbone forward pass (which dominates per-step wall time) and reduces each 500-step training run to under two minutes.

1. Run frozen backbone over N training images. Cache spatial features
   and intermediate block outputs to disk. One-time cost.
2. For each candidate: load cached features, train for 500 steps,
   run inference on diagnostic images, report loss curve and detections.

Evaluation Protocol

Backbone: EUPE-ViT-B (86M parameters, frozen). Features cached at 640-pixel input with letterbox padding. Spatial output: 768 channels, 40Γ—40 patches (stride 16). Intermediate block outputs [2, 5, 8, 11] also cached for heads that use them. The arena framework is backbone-agnostic β€” the same heads can be evaluated against any frozen ViT that produces a stride-16 spatial feature grid.

Domains: 21 total. COCO (80 classes) plus 20 RF100-VL datasets in COCO annotation format:

Domain Category Classes Train images
coco Standard 80 300 (cached from 117K)
actions Sport 6 300
aerial-airport Aerial 1 233
all-elements Document 10 300
aquarium-combined Flora/Fauna 7 300
defect-detection Industrial 4 300
dentalai Lab Imaging 4 300
flir-camera-objects Misc 4 300
gwhd2021 Flora/Fauna 1 300
lacrosse-object-detection Sport 4 300
new-defects-in-wood Misc 5 300
orionproducts Misc 8 300
paper-parts Document 19 300
recode-waste Industrial 6 300
soda-bottles Misc 3 300
the-dreidel-project Misc 6 300
trail-camera Flora/Fauna 2 300
water-meter Industrial 10 300
wb-prova Flora/Fauna 3 300
wildfire-smoke Aerial 1 300
x-ray-id Lab Imaging 6 300

Training: Each head trained from scratch per domain. 300 cached training images, random sampling with replacement at batch 1. Seed 42 for weight initialization and batch sampling. AdamW optimizer, lr 1e-3, weight decay 1e-4, gradient clipping at 5.0. No learning rate schedule, no augmentation (features are pre-extracted).

Screening evaluation: 10 validation images per domain with ground truth boxes. Precision and recall at IoU β‰₯ 0.5 with class matching. Score threshold 0.3. Metrics averaged across all 21 domains. The screening table below was computed on this protocol. A rerun at 1000 validation images per domain is pending.

COCO mAP evaluation: Full pycocotools protocol on 5,000 COCO val2017 images. mAP@[0.5:0.95] with standard IoU thresholds. Used for the converged Cofiber Threshold results reported below the screening table.

Screening: 2,000 steps (2,000 images seen). Each head sees the same training images in the same order due to the shared seed.

Extended training: 15,000 steps (15,000 images seen, ~50 passes over the 300-image cache). Same protocol otherwise.

Cross-Domain Screening (18 heads, 10 val images per domain)

The following table was computed on 10 validation images per domain. These results are coarse screening signals, not converged evaluations. A rerun at 1000 validation images per domain is pending.

Name Architecture Params Avg Prec Avg Recall Total TP
Wavelet Haar wavelet multi-scale, no learned downsampling 1.02M 0.724 0.305 876
Depth Fusion Shallow + deep ViT block fusion, depth conditioning 0.99M 0.656 0.247 879
Hook FCOS FCOS on reassembled intermediate blocks [2, 5, 8, 11] 2.42M 0.629 0.349 1070
Slim FCOS FCOS, 128 channels, 2-layer towers 2.51M 0.575 0.310 1027
Prototype Contrastive 80 learned class vectors, cosine similarity 1.41M 0.571 0.248 866
CenterNet Heatmap + offset + size on stride-16, no FPN 2.38M 0.548 0.511 1267
Threshold Prototype Single-layer class prototypes, Heaviside-compatible 0.07M 0.534 0.143 475
Cofiber Threshold Adjoint cofiber decomposition + LayerNorm + threshold 0.07M 0.475 0.193 478
Baseline FCOS Standard FCOS + ViTDet simple feature pyramid 16.14M 0.470 0.245 969
Feature Graph k-NN graph in feature space, gated message passing 2.81M 0.385 0.150 712
Cofiber Linear Adjoint cofiber decomposition, shared linear heads 0.07M 0.236 0.082 251
Scale Classify Scale as discrete bins, not continuous regression 1.62M 0.077 0.390 916
Adaptive Query Image-derived queries via attention pooling 1.10M 0.018 0.018 20
Cascade Pool 400β†’100β†’50 queries, progressive pruning 2.76M 0.001 0.001 3
Sparse Query 100 fixed learned queries, 2 cross-attention layers 1.89M 0.002 0.001 7
Patch Assembly Per-patch centroid voting, box from grouped votes 5.65M 0.000 0.000 0
Mutual Attention 6 self-attention blocks over all patch tokens 5.13M 0.000 0.000 0
Relational Corners Objects as scored patch pairs, low-rank bilinear 3.09M 0.000 0.000 0

Candidates with zero precision have not crossed the confidence threshold for detection at this training horizon; they may produce detections with longer training.

Additional heads (not yet screened across domains)

Name Architecture Params
Cofiber CenterNet Cofiber decomposition + heatmap loss 0.07M
Cofiber 5-Scale 5-level cofiber decomposition 0.07M
Cofiber Adaptive Learned pool sizes 0.07M
Optimal Transport Gaussian class distributions, Mahalanobis classification 0.13M
Tropical Tropical algebra inner product replaces standard dot product 0.07M
Compression Surprise-based spatial filtering + prototype classification 0.07M
Curvature Discrete Riemannian curvature modulation + prototype classification 0.07M

References

  • Baseline FCOS, Slim FCOS, Hook FCOS: Tian et al., "FCOS: Fully Convolutional One-Stage Object Detection," ICCV 2019. Hook variant uses intermediate ViT features from Li et al., "Exploring Plain Vision Transformer Backbones for Object Detection," ECCV 2022 (ViTDet).
  • CenterNet: Zhou et al., "Objects as Points," arXiv 2019.
  • Sparse Query, Adaptive Query, Cascade Pool: Conceptually related to Carion et al., "End-to-End Object Detection with Transformers," ECCV 2020 (DETR).
  • Feature Graph: Related to Hu et al., "Relation Networks for Object Detection," CVPR 2018. The k-NN construction in feature space rather than spatial space is original to this work.
  • Depth Fusion: Uses intermediate ViT features (Li et al. 2022); depth-conditioned detection is related to Wang et al., "FCOS3D," NeurIPS 2021. The unsupervised depth signal from detection loss alone is original.
  • Wavelet, Prototype Contrastive, Scale Classify, Patch Assembly, Mutual Attention, Relational Corners: Original architectures.
  • Cofiber Linear, Cofiber Threshold, Threshold Prototype: Original architectures. The cofiber decomposition is derived from adjoint functors in stable category theory; see phanerozoic/threshold-cofiber-detection for the machine-checked proof.

Adjoint-scale decomposition

The feature pyramid in conventional detection heads is a learned approximation of multi-scale decomposition. Three candidates in the arena (Cofiber Linear, Threshold Prototype, Cofiber Threshold) replace it with an analytic decomposition using the adjoint structure of upsampling and downsampling operators. Bilinear interpolation (upsample, zero parameters) and average pooling (downsample, zero parameters) form an adjoint pair. The cofiber of the round-trip map β€” f - upsample(downsample(f)) β€” isolates the information present at a given scale but absent at the next coarser scale.

  • Cofiber Linear: Analytic decomposition followed by shared linear prediction heads across all scales. 65,000 parameters.
  • Threshold Prototype: Single-layer prototype classification on the raw stride-16 features, no scale decomposition. Integer-weight-compatible, deployable on neuromorphic hardware. 65,000 parameters.
  • Cofiber Threshold: Analytic decomposition followed by per-scale LayerNorm and prototype classification. Combines the multi-scale structure of the first with the threshold-compatible prediction of the second. 65,000 parameters.

The decomposition is derived from the structure of adjoint functors and distinguished triangles in stable category theory. The cofiber of the unit map Ξ·: 1 β†’ ΩΣ in a stable category decomposes an object into its per-scale components via an exact sequence. Applied to the spatial feature grid, this gives a lossless multi-scale representation at zero parameter cost.

Extended training (15,000 steps)

Threshold Prototype and Cofiber Threshold were trained for 15,000 steps at batch 1 across all 21 domains. Both improved substantially over the 2,000-step screening.

Name Params Avg Prec (2K) Avg Prec (15K) Avg Recall (15K) Total TP (15K)
Threshold Prototype 0.07M 0.534 0.627 0.242 597
Cofiber Threshold 0.07M 0.475 0.617 0.368 719

Both 65K-parameter heads exceed the Baseline FCOS (16.14M, 0.470 avg precision at 2,000 steps) on cross-domain precision at 230x fewer parameters. Cofiber Threshold produces 51% more true positives than Threshold Prototype (719 vs 597), reflecting the advantage of multi-scale cofiber decomposition over single-stride prediction at longer training horizons.

Cofiber Threshold variants (15,000 steps)

Three variants of Cofiber Threshold were tested to probe whether the original 3-scale fixed decomposition is optimal.

Variant Architecture Params Avg Prec Avg Recall Total TP
Cofiber Threshold (original) 3 fixed scales, FCOS loss 0.07M 0.617 0.368 719
Cofiber Adaptive Learned pool sizes, FCOS loss 0.07M 0.584 0.345 759
Cofiber 5-Scale 5 fixed scales, FCOS loss 0.07M 0.579 0.356 724
Cofiber CenterNet 3 fixed scales, heatmap loss 0.07M 0.190 0.573 1238

The original design leads on precision. Additional scales, adaptive boundaries, and an alternative loss formulation did not improve over the fixed 3-scale FCOS configuration. The CenterNet variant produces the highest recall and total detections but at substantially lower precision.

COCO val2017 (pycocotools, 5000 images)

Cofiber Threshold variants trained on full COCO 2017 train (117,266 images). Frozen EUPE-ViT-B backbone, 640-pixel input with letterbox padding. Evaluated with pycocotools on the standard 5000-image val set.

Variant Box regression Params Nonzero mAP@[0.5:0.95] mAP@0.50 mAP@0.75
linear_70k 768β†’4 69,976 69,976 4.0 15.8 0.8
box32_92k 768β†’32β†’4 91,640 91,640 5.7 20.6 1.3
box32 pruned R1 768β†’32β†’4 91,640 76,640 5.7 20.7 1.3
box32 pruned R2 768β†’32β†’4 91,640 ~62,000 5.9 20.4 1.5
box32 pruned R3 768β†’32β†’4 91,640 ~47,000 5.1 17.1 1.4
dim20 768β†’20β†’80 cls, 20β†’16β†’4 reg 22,076 22,076 3.9 14.8 0.9
dim20 R1 (project 25.7% sparse) 768β†’20β†’80 cls 22,076 18,121 3.9 14.6 0.8
dim20 R2 (project 26.6% sparse) 768β†’20β†’80 cls 22,076 17,988 3.8 14.5 0.7
dim20 cls_weight pruned (37%) 596 of 1600 cls weights zeroed 22,076 21,480 3.8 14.4 0.7
dim20 reg_hidden pruned (17%) 55 of 320 reg weights zeroed 22,076 22,021 3.8 14.5 0.7
dim20 reg_out pruned (12%) 8 of 64 reg weights zeroed 22,076 22,068 3.8 14.5 0.7
dim20 ctr_weight pruned (90%) 18 of 20 ctr weights zeroed 22,076 22,058 3.7 14.2 0.7
dim20 R1 + cls greedy project 25.7% + cls 45% sparse 22,076 17,406 3.5 13.4 0.6
dim20 joint (from R1) whole-head magnitude pruning 22,076 17,129 3.6 13.7 0.6
dim15 768β†’15β†’80 cls, 15β†’16β†’4 reg 17,751 17,751 3.0 11.5 0.7
dim10 768β†’10β†’80 cls, 10β†’16β†’4 reg 13,426 13,426 1.5 5.6 0.4
dim5 768β†’5β†’80 cls, 5β†’16β†’4 reg 9,101 9,101 0.3 1.3 0.1

Pruning improved mAP from 5.7 to 5.9 at R2 (~62K nonzero) by removing noisy prototype weights. R3 pushed past the degradation threshold. SVD analysis of the R2 prototypes showed effective rank ~20 for 72% energy retention, motivating the dim20 variant: a 768β†’20 bottleneck projection followed by 20β†’80 classification, initialized from the SVD vectors of the pruned prototypes. Dim20 produces 3.9 mAP from 22,076 parameters β€” an 80-class COCO detector in 22K params.

The dim15, dim10, and dim5 variants push the bottleneck further with the same SVD-initialization recipe applied to the top 15, top 10, and top 5 directions of the pruned R2 prototype matrix. Dim15 (17,751 parameters, 67% SVD energy retention) reaches 3.0 mAP. Dim10 (13,426 parameters, 61% energy retention) reaches 1.5 mAP β€” the smallest 80-class COCO detector to clear the 1.0 mAP threshold. Dim5 (9,101 parameters, 53% energy retention) drops to 0.3 mAP. The mAP scaling across dim20 β†’ dim15 β†’ dim10 is roughly geometric (3.9 β†’ 3.0 β†’ 1.5), but reverses sharply between 10 and 5 dimensions where the curve falls off a cliff. Five directions sit below the intrinsic capacity needed for 80-class separation; the floor lies between 5 and 10 bottleneck dimensions, and finer probes at dim7/dim8 would localize the exact cliff.

Dim20 was then itself put through magnitude pruning. The mAP-driven pruner bisects over the magnitude-sorted weight list of a chosen parameter, uses full pycocotools mAP@[0.5:0.95] as the retention metric (1000 val images), and rolls back any pass that fails the 95% retention floor on full verification. It was run separately on each learned parameter of dim20 plus a joint-magnitude variant that ranks every weight in the head against every other.

The leading Pareto point is R1 (project layer 25.7% sparse, 18,121 nonzero, 3.9 mAP) β€” same mAP as unpruned dim20 with 18% fewer effective parameters, the highest mAP-per-10K-parameter ratio in the table at 2.15. R2 pushes to 26.6% project sparsity (17,988 nonzero) at a small mAP cost (3.8). Per-parameter slack measurements:

  • project.weight: 26.6% prunable (the sparsity that produced R1/R2)
  • cls_weight: 37% prunable in isolation, 3.8 mAP at 21,480 nonzero
  • reg_hidden.weight: 17% prunable, 3.8 mAP at 22,021 nonzero
  • reg_out.weight: 12% prunable, 3.8 mAP at 22,068 nonzero
  • ctr_weight: 90% prunable (only 2 of 20 centerness weights load-bear), 3.7 mAP at 22,058 nonzero

Greedy stacking of cls_weight pruning on top of R1 reaches 17,406 nonzero but drops to 3.5 mAP β€” interaction between the parameters: the cls_weight slack measured against unpruned dim20 partly comes from compensating for the surviving project subspace, so removing it after pruning project costs more mAP than the per-parameter measurement suggested. Joint magnitude pruning across all 22K head weights (starting from R1) finds 17,129 nonzero at 3.6 mAP, which is the smallest dim20 found but does not Pareto-dominate R1 β€” the bisection's 1000-image mAP proxy was systematically optimistic relative to the full 5000-image eval, so the 95% retention floor measured during pruning gave more aggressive cuts than the full eval would have accepted. R1 remains the leading point of the dim20 pruning Pareto.

Training recipe

Variant Script Epochs Batch Optimizer LR Schedule Initialization
linear (linear_70k) cache_and_train_fast.py (legacy train.py) 8 64 AdamW (wd 1e-4) 1e-3 cosine, 3% warmup random
box32 (box32_92k) cache_and_train_fast.py (legacy train.py) 8 64 AdamW (wd 1e-4) 1e-3 cosine, 3% warmup random
box32 pruned R1/R2/R3 prune.py (legacy TP-retention pruner) β€” β€” β€” β€” β€” from box32 checkpoint
dim20 cache_and_train_fast.py --dim 20 8 64 AdamW (wd 1e-4) 1e-3 cosine, 3% warmup SVD of pruned R2 prototypes
dim15 cache_and_train_fast.py --dim 15 8 128 AdamW (wd 1e-4) 1e-3 cosine, 3% warmup SVD of pruned R2 prototypes + analytical least-squares cls init from cached features
dim10 cache_and_train_fast.py --dim 10 8 128 AdamW (wd 1e-4) 1e-3 cosine, 3% warmup SVD of pruned R2 prototypes + analytical least-squares cls init from cached features
dim5 cache_and_train_fast.py --dim 5 8 128 AdamW (wd 1e-4) 1e-3 cosine, 3% warmup SVD of pruned R2 prototypes + analytical least-squares cls init from cached features
dim20 pruned R1/R2 prune.py --param project.weight β€” β€” β€” β€” β€” from dim20 checkpoint, mAP-driven bisection pruner

The original box32 pruning sequence used an early TP-retention pruner that was later shown to be too lenient (it accepted weight reductions that increased TP via false positives). The dim20 pruning was done with the rewritten prune.py that scores by full pycocotools mAP, bisects over the magnitude-sorted weight list, and rolls back any pass that fails the 95% mAP retention floor on the full 1000-image validation subset.

Checkpoints

Trained weights and per-checkpoint pycocotools eval JSONs live in heads/cofiber_threshold/<variant>/ (the <variant> folder name is the simple variant identifier β€” linear, box32, dim5, dim10, dim15, dim20, etc.). The same files are mirrored to phanerozoic/threshold-cofiber-detection via scripts/build_threshold_repo.py.

Repository Structure

Each head is a self-contained folder under heads/ with its own implementation and documentation. Shared loss functions are in losses/, shared utilities in utils/. The arena scripts (arena.py, multi_domain_arena.py) run any head by name against cached backbone features.

License

The arena framework and all candidate architectures are released under the Apache 2.0 license. The framework is backbone-agnostic; users are responsible for complying with the license terms of whatever backbone they use to generate cached features.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for phanerozoic/detection-heads

Finetuned
(5)
this model

Dataset used to train phanerozoic/detection-heads