File size: 2,510 Bytes

faf011c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32a21e5
 
4779f84
32a21e5
 
faf011c
 
32a21e5
faf011c
32a21e5
faf011c

# Stage 4C: Direct Classifier-Score Supervision

Same 3.27 M student as Stage 4. Same 40-D output. Different loss:

```python
student_score = student_out[pos_dims].sum() - student_out[neg_dims].sum()
teacher_score = teacher_target[pos_dims].sum() - teacher_target[neg_dims].sum()
loss = (student_score - teacher_score) ** 2
```

The student is optimized to match the teacher's scalar classifier output, not the 768-D feature vector (Stage 4B) or the 40 individual dims (Stage 4A).

## Result

```
Stage   Student params   Loss                              F1      Threshold   checkpoint
 4       3.27 M          MSE on 40-D per-dim              0.717    26.4        ep3
 4B     15.67 M          cosine on 768-D                  0.726   165.9        ep10  (scale drifted)
 4C      3.27 M          MSE on scalar sum-difference     0.734    25.0        ep10  (matches teacher 25.3)
 0      85.64 M (ViT-B)  baseline                         0.889    25.3        —
```

Shipped as `student_final.safetensors` = epoch 10 checkpoint. Epoch 10 threshold 25.04 lands within 0.3 of the teacher's 25.28, cleanest scale-calibration across the three student variants. Epoch 15 drifts down to F1 0.729 with threshold 25.84, and an unsaved epoch 8 snapshot actually hit 0.740 (precision 0.627, recall 0.904) though it was not checkpointed on the every-5-epochs schedule.

F1 improved by +0.008 over Stage 4B. All three student experiments plateau around 0.72-0.73 with high recall (≥0.95) and precision ~0.58 through most of training. The student converges on an "over-fire" operating point that no amount of loss-shape tuning fully fixes.

## What this says

The bottleneck is not loss choice or target geometry but the student's ability to learn the underlying scene-level signal at this scale. Closing the F1 gap to baseline 0.889 at the 3 M parameter tier probably requires:

- Stronger image augmentation (mosaic, color jitter, rand-augment)
- Warm-starting from a pre-trained backbone (EUPE-ViT-T already distilled) rather than from scratch
- More training data beyond COCO-only (117 K images is tight for a specialist from scratch)

Parameter scaling alone doesn't help; loss reshape alone doesn't help. Data and initialization are the remaining knobs.

## Files

- `train.py` — training loop (direct scalar MSE)
- `student_ep{5,10,15}.safetensors` — intermediate checkpoints
- `student_final.safetensors` — final weights
- `training_log.json` — per-epoch loss + F1

Uses the same `student.py` as Stage 4.