# Stage 1: Output-Channel Pruning Path A implementation: runtime slicing with classifier fusion. Zero behavior drift from Stage 0. ## What changed The backbone emits a 768-D vector per token. Stage 1 wraps that output so downstream code sees only the 100 dimensions the classifier reads. The classifier is fused into a single `Linear(100, 1)` layer with ternary {+1, -1} fixed weights and one free parameter expressed as the negated threshold bias. ```python from model import Stage1PersonClassifier model = Stage1PersonClassifier.from_pretrained_argus('phanerozoic/argus') score, pred = model(image_tensor) # (B,) float, (B,) bool ``` ## What did not change Compute: identical to Stage 0. The backbone still produces all 768 output channels, the final LayerNorm still runs across 768 dimensions, and the 668 unused channels are sliced off at the end. This stage exists to crystallize the interface (single classifier head, single learnable scalar) before later stages actually shrink the backbone. ## Evaluation 5000 COCO val 2017 images, live Argus forward pass at 768 pixel input: ``` Stage 0 F1 0.8886 Stage 1 F1 0.8886 (exact parity, match=true) ``` See `eval.json` for precision and recall. ## What Path B would look like Path B drops the last block's MLP `fc2` output rows for the 668 unused dimensions, calibrates the final LayerNorm using fixed per-channel statistics collected from a corpus, and reduces `fc2` from (3072 → 768) to (3072 → 100). Saves ~2.3M backbone parameters (1.6% of 85M). Expected F1 drift is small (<0.02) but non-zero due to the LayerNorm approximation. Not implemented in this stage; belongs in a later iteration or a Stage 1B branch if pursued.