Stage 1: Output-Channel Pruning
Path A implementation: runtime slicing with classifier fusion. Zero behavior drift from Stage 0.
What changed
The backbone emits a 768-D vector per token. Stage 1 wraps that output so downstream code sees only the 100 dimensions the classifier reads. The classifier is fused into a single Linear(100, 1) layer with ternary {+1, -1} fixed weights and one free parameter expressed as the negated threshold bias.
from model import Stage1PersonClassifier
model = Stage1PersonClassifier.from_pretrained_argus('phanerozoic/argus')
score, pred = model(image_tensor) # (B,) float, (B,) bool
What did not change
Compute: identical to Stage 0. The backbone still produces all 768 output channels, the final LayerNorm still runs across 768 dimensions, and the 668 unused channels are sliced off at the end. This stage exists to crystallize the interface (single classifier head, single learnable scalar) before later stages actually shrink the backbone.
Evaluation
5000 COCO val 2017 images, live Argus forward pass at 768 pixel input:
Stage 0 F1 0.8886
Stage 1 F1 0.8886 (exact parity, match=true)
See eval.json for precision and recall.
What Path B would look like
Path B drops the last block's MLP fc2 output rows for the 668 unused dimensions, calibrates the final LayerNorm using fixed per-channel statistics collected from a corpus, and reduces fc2 from (3072 → 768) to (3072 → 100). Saves ~2.3M backbone parameters (1.6% of 85M). Expected F1 drift is small (<0.02) but non-zero due to the LayerNorm approximation. Not implemented in this stage; belongs in a later iteration or a Stage 1B branch if pursued.