Stage 0: Baseline 1-Parameter Classifier
Image-level person classifier on frozen EUPE-ViT-B features. One free scalar parameter.
Classifier
Given a 768 pixel input image, forward through EUPE-ViT-B, take the 2304 patch tokens at the final layer, apply layernorm across the 768-channel axis, and max-pool across patches to get a single 768-D vector per image. The classifier reads 40 of those 768 dimensions: 20 person-positive, 20 person-negative. Sum the positives, subtract the negatives, compare against one learned threshold.
# pseudocode
patches = backbone(image)["x_norm_patchtokens"] # (2304, 768)
pooled = layernorm(patches, 768).max(dim=patches) # (768,)
score = pooled[pos_dims].sum() - pooled[neg_dims].sum()
pred = score > threshold # bool
All arrays (pos_dims, neg_dims, threshold) are in classifier.json.
Evaluation
F1 = 0.889, precision = 0.901, recall = 0.876 on 5000 COCO val 2017 images, measured through the live Argus forward pass at 768 pixel input.
See eval.json.
How the dim selection was discovered
Three-step process, documented in the artifacts below.
cojoint_discovery.json — Sampled 100,000 random 92-dim subsets of the 768-D EUPE-ViT-B feature space, trained a ridge classifier for each, kept the top 1%, counted dim occurrence frequency across the kept cohort. Dim 48 appeared in 100% of top-1000 subsets. Next strongest (dim 525) appeared in 31%.
characterization.json — Five analyses on dim 48 specifically. F1 versus K (dim 48 alone reaches F1 = 0.83). Activation distribution for person-positive versus person-negative images (Cohen's d = 1.98). Per-class activation delta for each of 80 COCO categories. Top-10 frequent-dim pairwise correlation (max |r| = 0.57, mostly independent). Spatial IoU of dim-48 peak activations against ground-truth person boxes (mean IoU = 0.17 — dim 48 is a scene-level signal, not a pixel-level localizer).
compressed_variants.json — Leaderboard of 20+ classifier variants ranging from 1 free parameter to 769. Ranked, the ternary ±1 on 50 positive plus 50 negative dims wins at F1 = 0.893. The 20+20 variant chosen for Stage 0 is the same recipe at smaller footprint with F1 = 0.881 cached / 0.889 live.
Interpretation
Dim 48 is the canonical anthropogenic-scene axis in EUPE-ViT-B. It activates strongly on person scenes and on person-associated objects (sports equipment, wearable accessories, handheld items), and is suppressed on non-human animals and non-anthropogenic structures. Alone it delivers F1 = 0.83 as a 2-parameter classifier. The additional 39 dims stack on mostly orthogonal axes to reach F1 = 0.89 at 1 free parameter.
Prop-specificity audit
Artifacts prop_specificity_audit.json and prop_audit_image_list.json record a later audit of how well the classifier distinguishes "person-scene" from "scene containing person-associated objects but no person." 8,479 ImageNet training images were filtered with YOLO26l at confidence 0.25 to retain only images where no person was detected, drawn from 20 person-associated synsets (racket, tie, backpack, helmet, ski, remote, musical instruments, etc.). The Stage 0 classifier fires on 5.9% of those truly person-free images at its operating threshold — tight enough that the "fires on tennis rackets and backpacks" concern in the per-class activation analysis is mostly a co-occurrence artifact in COCO (tennis racket images almost always contain the player). Adding up to 15 prop-specific negative dims lowers the false-positive rate to 2.7% but costs F1 on the primary person task.
Hardware footprint estimate
At INT8 precision, the classifier synthesizes to an estimated 2,500–4,100 gates: two Wallace-tree adders (50-input each), one subtractor, one comparator. For reference, a 768-dim INT8 MAC unit is roughly 65,000 gates, and the prior 4,614-parameter multi-output person detector synthesizes to roughly 391,000 gates.