File size: 2,510 Bytes
faf011c 32a21e5 4779f84 32a21e5 faf011c 32a21e5 faf011c 32a21e5 faf011c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | # Stage 4C: Direct Classifier-Score Supervision
Same 3.27 M student as Stage 4. Same 40-D output. Different loss:
```python
student_score = student_out[pos_dims].sum() - student_out[neg_dims].sum()
teacher_score = teacher_target[pos_dims].sum() - teacher_target[neg_dims].sum()
loss = (student_score - teacher_score) ** 2
```
The student is optimized to match the teacher's scalar classifier output, not the 768-D feature vector (Stage 4B) or the 40 individual dims (Stage 4A).
## Result
```
Stage Student params Loss F1 Threshold checkpoint
4 3.27 M MSE on 40-D per-dim 0.717 26.4 ep3
4B 15.67 M cosine on 768-D 0.726 165.9 ep10 (scale drifted)
4C 3.27 M MSE on scalar sum-difference 0.734 25.0 ep10 (matches teacher 25.3)
0 85.64 M (ViT-B) baseline 0.889 25.3 —
```
Shipped as `student_final.safetensors` = epoch 10 checkpoint. Epoch 10 threshold 25.04 lands within 0.3 of the teacher's 25.28, cleanest scale-calibration across the three student variants. Epoch 15 drifts down to F1 0.729 with threshold 25.84, and an unsaved epoch 8 snapshot actually hit 0.740 (precision 0.627, recall 0.904) though it was not checkpointed on the every-5-epochs schedule.
F1 improved by +0.008 over Stage 4B. All three student experiments plateau around 0.72-0.73 with high recall (≥0.95) and precision ~0.58 through most of training. The student converges on an "over-fire" operating point that no amount of loss-shape tuning fully fixes.
## What this says
The bottleneck is not loss choice or target geometry but the student's ability to learn the underlying scene-level signal at this scale. Closing the F1 gap to baseline 0.889 at the 3 M parameter tier probably requires:
- Stronger image augmentation (mosaic, color jitter, rand-augment)
- Warm-starting from a pre-trained backbone (EUPE-ViT-T already distilled) rather than from scratch
- More training data beyond COCO-only (117 K images is tight for a specialist from scratch)
Parameter scaling alone doesn't help; loss reshape alone doesn't help. Data and initialization are the remaining knobs.
## Files
- `train.py` — training loop (direct scalar MSE)
- `student_ep{5,10,15}.safetensors` — intermediate checkpoints
- `student_final.safetensors` — final weights
- `training_log.json` — per-epoch loss + F1
Uses the same `student.py` as Stage 4.
|