File size: 3,738 Bytes
81d1bef 864ba61 81d1bef 864ba61 52a3a00 864ba61 52a3a00 864ba61 52a3a00 864ba61 52a3a00 864ba61 52a3a00 864ba61 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | # Stage 4: Specialist Backbone
Train a compact student to reproduce the 40 target dimensions (20 person-positive + 20 person-negative) that the Stage 0 classifier reads out of EUPE-ViT-B. The student takes the same 768 pixel input as the teacher and emits a 40-D vector per image. Composed with the Stage 0 ternary classifier, this gives a full person-detection pipeline of **3.27M parameters**, versus 85.64M for the teacher.
## Student architecture
Compact ViT, defined in `student.py`:
- Patch size 16, input 768 px β 48Γ48 = 2304 tokens
- Embed dim 192, depth 6 blocks, 3 heads per block, MLP ratio 4
- Final LayerNorm β max-pool over patches β Linear(192, 40)
- Total: 3,267,304 parameters (3.27M)
## Training recipe
- Data: COCO train 2017 (117,266 images), resized to 768Γ768, ImageNet normalization
- Teacher targets: pre-computed by `prepare_targets.py` from the existing EUPE-ViT-B feature cache
- Loss: MSE on the 40-D output
- Optimizer: AdamW, lr 3e-4, weight decay 1e-4
- Schedule: cosine with 3% warmup
- Batch 16, 10 epochs, bfloat16 autocast
- Wall time: 26 minutes on one RTX 6000 Ada
## Result
F1 = **0.717 at epoch 3** (shipped as `student_final.safetensors`). The training loop evaluated every epoch and epoch 3 is the peak; later epochs drift slightly downward as loss saturates.
```
epoch loss F1 P R
1 2.25 0.707 0.55 1.00
3 2.01 0.717 0.57 0.97 <- shipped
5 2.00 0.712 0.56 0.98
10 1.99 0.710 0.57 0.95
```
Loss plateaus around 2.0 after epoch 2. Precision stays ~0.57 across all epochs while recall is >0.95, the student learns "when in doubt, call it person" and rarely misses a true positive but fires false on about half of person-negatives.
## Comparison
```
model params F1 on COCO val
Stage 0 (EUPE-ViT-B + classifier) 85.64M 0.894
Stage 2 K=10 head prune + classifier 83.67M 0.916
Stage 4 student + classifier 3.27M 0.717
```
26Γ smaller than the full EUPE-ViT-B pipeline, F1 drop of 0.18 from baseline. A proof that the 40-D target manifold is learnable by a compact specialist but not yet a drop-in replacement for the teacher.
## What this stage ships
- `student.py` β architecture definition
- `prepare_targets.py` β builds teacher target tensor from ViT-B feature cache
- `train.py` β distillation loop
- `student_final.safetensors` β trained student weights (10 epochs)
- `student_ep*.safetensors` β per-epoch checkpoints
- `training_log.json` β loss + F1 per epoch
## Where the F1 gap comes from
The student converged on a high-recall / low-precision operating mode. Three likely contributors:
1. **Capacity mismatch.** Going from 85.6M to 3.27M is an aggressive compression ratio. The EUPE teacher itself was distilled from a 1.9B proxy, then again to 86M β a further order-of-magnitude compression to 3M is harder than it looks.
2. **Data limit.** Training on 117K COCO images is narrow. The teacher was trained on LVD-1689M (1.7B images) and absorbs much broader scene statistics. Re-training with a larger image corpus (ImageNet-1k or OpenImages) would likely help.
3. **Loss choice.** MSE on 40 dimensions is a coarse target. Cosine similarity or a reconstruction-plus-contrastive loss over the teacher's full 768-D pooled vector would preserve more structure.
## Natural next iteration (Stage 4B)
1. Scale the student to 10β15M params (depth 8, dim 256).
2. Add ImageNet as a second training corpus once its cache finishes.
3. Switch loss to cosine on full 768-D pooled teacher output, project to 40-D only at inference.
4. Longer schedule: 30 epochs.
Target: bring F1 above 0.85 at β€15M total specialist-pipeline parameters.
|