Stage 4C: Direct Classifier-Score Supervision
Same 3.27 M student as Stage 4. Same 40-D output. Different loss:
student_score = student_out[pos_dims].sum() - student_out[neg_dims].sum()
teacher_score = teacher_target[pos_dims].sum() - teacher_target[neg_dims].sum()
loss = (student_score - teacher_score) ** 2
The student is optimized to match the teacher's scalar classifier output, not the 768-D feature vector (Stage 4B) or the 40 individual dims (Stage 4A).
Result
Stage Student params Loss F1 Threshold checkpoint
4 3.27 M MSE on 40-D per-dim 0.717 26.4 ep3
4B 15.67 M cosine on 768-D 0.726 165.9 ep10 (scale drifted)
4C 3.27 M MSE on scalar sum-difference 0.734 25.0 ep10 (matches teacher 25.3)
0 85.64 M (ViT-B) baseline 0.889 25.3 —
Shipped as student_final.safetensors = epoch 10 checkpoint. Epoch 10 threshold 25.04 lands within 0.3 of the teacher's 25.28, cleanest scale-calibration across the three student variants. Epoch 15 drifts down to F1 0.729 with threshold 25.84, and an unsaved epoch 8 snapshot actually hit 0.740 (precision 0.627, recall 0.904) though it was not checkpointed on the every-5-epochs schedule.
F1 improved by +0.008 over Stage 4B. All three student experiments plateau around 0.72-0.73 with high recall (≥0.95) and precision ~0.58 through most of training. The student converges on an "over-fire" operating point that no amount of loss-shape tuning fully fixes.
What this says
The bottleneck is not loss choice or target geometry but the student's ability to learn the underlying scene-level signal at this scale. Closing the F1 gap to baseline 0.889 at the 3 M parameter tier probably requires:
- Stronger image augmentation (mosaic, color jitter, rand-augment)
- Warm-starting from a pre-trained backbone (EUPE-ViT-T already distilled) rather than from scratch
- More training data beyond COCO-only (117 K images is tight for a specialist from scratch)
Parameter scaling alone doesn't help; loss reshape alone doesn't help. Data and initialization are the remaining knobs.
Files
train.py— training loop (direct scalar MSE)student_ep{5,10,15}.safetensors— intermediate checkpointsstudent_final.safetensors— final weightstraining_log.json— per-epoch loss + F1
Uses the same student.py as Stage 4.