# Stage 0: Baseline 1-Parameter Classifier Image-level person classifier on frozen EUPE-ViT-B features. One free scalar parameter. ## Classifier Given a 768 pixel input image, forward through EUPE-ViT-B, take the 2304 patch tokens at the final layer, apply layernorm across the 768-channel axis, and max-pool across patches to get a single 768-D vector per image. The classifier reads 40 of those 768 dimensions: 20 person-positive, 20 person-negative. Sum the positives, subtract the negatives, compare against one learned threshold. ```python # pseudocode patches = backbone(image)["x_norm_patchtokens"] # (2304, 768) pooled = layernorm(patches, 768).max(dim=patches) # (768,) score = pooled[pos_dims].sum() - pooled[neg_dims].sum() pred = score > threshold # bool ``` All arrays (`pos_dims`, `neg_dims`, `threshold`) are in `classifier.json`. ## Evaluation F1 = 0.889, precision = 0.901, recall = 0.876 on 5000 COCO val 2017 images, measured through the live Argus forward pass at 768 pixel input. See `eval.json`. ## How the dim selection was discovered Three-step process, documented in the artifacts below. **`cojoint_discovery.json`** — Sampled 100,000 random 92-dim subsets of the 768-D EUPE-ViT-B feature space, trained a ridge classifier for each, kept the top 1%, counted dim occurrence frequency across the kept cohort. Dim 48 appeared in 100% of top-1000 subsets. Next strongest (dim 525) appeared in 31%. **`characterization.json`** — Five analyses on dim 48 specifically. F1 versus K (dim 48 alone reaches F1 = 0.83). Activation distribution for person-positive versus person-negative images (Cohen's d = 1.98). Per-class activation delta for each of 80 COCO categories. Top-10 frequent-dim pairwise correlation (max |r| = 0.57, mostly independent). Spatial IoU of dim-48 peak activations against ground-truth person boxes (mean IoU = 0.17 — dim 48 is a scene-level signal, not a pixel-level localizer). **`compressed_variants.json`** — Leaderboard of 20+ classifier variants ranging from 1 free parameter to 769. Ranked, the ternary ±1 on 50 positive plus 50 negative dims wins at F1 = 0.893. The 20+20 variant chosen for Stage 0 is the same recipe at smaller footprint with F1 = 0.881 cached / 0.889 live. ## Interpretation Dim 48 is the canonical anthropogenic-scene axis in EUPE-ViT-B. It activates strongly on person scenes and on person-associated objects (sports equipment, wearable accessories, handheld items), and is suppressed on non-human animals and non-anthropogenic structures. Alone it delivers F1 = 0.83 as a 2-parameter classifier. The additional 39 dims stack on mostly orthogonal axes to reach F1 = 0.89 at 1 free parameter. ## Prop-specificity audit Artifacts `prop_specificity_audit.json` and `prop_audit_image_list.json` record a later audit of how well the classifier distinguishes "person-scene" from "scene containing person-associated objects but no person." 8,479 ImageNet training images were filtered with YOLO26l at confidence 0.25 to retain only images where no person was detected, drawn from 20 person-associated synsets (racket, tie, backpack, helmet, ski, remote, musical instruments, etc.). The Stage 0 classifier fires on **5.9%** of those truly person-free images at its operating threshold — tight enough that the "fires on tennis rackets and backpacks" concern in the per-class activation analysis is mostly a co-occurrence artifact in COCO (tennis racket images almost always contain the player). Adding up to 15 prop-specific negative dims lowers the false-positive rate to 2.7% but costs F1 on the primary person task. ## Hardware footprint estimate At INT8 precision, the classifier synthesizes to an estimated 2,500–4,100 gates: two Wallace-tree adders (50-input each), one subtractor, one comparator. For reference, a 768-dim INT8 MAC unit is roughly 65,000 gates, and the prior 4,614-parameter multi-output person detector synthesizes to roughly 391,000 gates.