phanerozoic's picture
Stage 4: ship ep3 checkpoint (peak F1 0.717 vs ep10 0.710)
52a3a00 verified

Stage 4: Specialist Backbone

Train a compact student to reproduce the 40 target dimensions (20 person-positive + 20 person-negative) that the Stage 0 classifier reads out of EUPE-ViT-B. The student takes the same 768 pixel input as the teacher and emits a 40-D vector per image. Composed with the Stage 0 ternary classifier, this gives a full person-detection pipeline of 3.27M parameters, versus 85.64M for the teacher.

Student architecture

Compact ViT, defined in student.py:

  • Patch size 16, input 768 px β†’ 48Γ—48 = 2304 tokens
  • Embed dim 192, depth 6 blocks, 3 heads per block, MLP ratio 4
  • Final LayerNorm β†’ max-pool over patches β†’ Linear(192, 40)
  • Total: 3,267,304 parameters (3.27M)

Training recipe

  • Data: COCO train 2017 (117,266 images), resized to 768Γ—768, ImageNet normalization
  • Teacher targets: pre-computed by prepare_targets.py from the existing EUPE-ViT-B feature cache
  • Loss: MSE on the 40-D output
  • Optimizer: AdamW, lr 3e-4, weight decay 1e-4
  • Schedule: cosine with 3% warmup
  • Batch 16, 10 epochs, bfloat16 autocast
  • Wall time: 26 minutes on one RTX 6000 Ada

Result

F1 = 0.717 at epoch 3 (shipped as student_final.safetensors). The training loop evaluated every epoch and epoch 3 is the peak; later epochs drift slightly downward as loss saturates.

epoch   loss   F1      P      R
  1    2.25   0.707  0.55  1.00
  3    2.01   0.717  0.57  0.97   <- shipped
  5    2.00   0.712  0.56  0.98
 10    1.99   0.710  0.57  0.95

Loss plateaus around 2.0 after epoch 2. Precision stays ~0.57 across all epochs while recall is >0.95, the student learns "when in doubt, call it person" and rarely misses a true positive but fires false on about half of person-negatives.

Comparison

model                                params       F1 on COCO val
Stage 0 (EUPE-ViT-B + classifier)    85.64M       0.894
Stage 2 K=10 head prune + classifier 83.67M       0.916
Stage 4 student + classifier          3.27M       0.717

26Γ— smaller than the full EUPE-ViT-B pipeline, F1 drop of 0.18 from baseline. A proof that the 40-D target manifold is learnable by a compact specialist but not yet a drop-in replacement for the teacher.

What this stage ships

  • student.py β€” architecture definition
  • prepare_targets.py β€” builds teacher target tensor from ViT-B feature cache
  • train.py β€” distillation loop
  • student_final.safetensors β€” trained student weights (10 epochs)
  • student_ep*.safetensors β€” per-epoch checkpoints
  • training_log.json β€” loss + F1 per epoch

Where the F1 gap comes from

The student converged on a high-recall / low-precision operating mode. Three likely contributors:

  1. Capacity mismatch. Going from 85.6M to 3.27M is an aggressive compression ratio. The EUPE teacher itself was distilled from a 1.9B proxy, then again to 86M β€” a further order-of-magnitude compression to 3M is harder than it looks.
  2. Data limit. Training on 117K COCO images is narrow. The teacher was trained on LVD-1689M (1.7B images) and absorbs much broader scene statistics. Re-training with a larger image corpus (ImageNet-1k or OpenImages) would likely help.
  3. Loss choice. MSE on 40 dimensions is a coarse target. Cosine similarity or a reconstruction-plus-contrastive loss over the teacher's full 768-D pooled vector would preserve more structure.

Natural next iteration (Stage 4B)

  1. Scale the student to 10–15M params (depth 8, dim 256).
  2. Add ImageNet as a second training corpus once its cache finishes.
  3. Switch loss to cosine on full 768-D pooled teacher output, project to 40-D only at inference.
  4. Longer schedule: 30 epochs.

Target: bring F1 above 0.85 at ≀15M total specialist-pipeline parameters.