File size: 2,444 Bytes
c75b31a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8266eec
 
 
 
c75b31a
 
8266eec
c75b31a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Stage 4B: Larger Specialist with Cosine Loss

Tried the natural next knobs on Stage 4's specialist student: 5Γ— bigger model, cosine similarity loss on the full 768-D pooled teacher output, longer schedule.

## Setup

- **Architecture**: depth 8, embed 384, 6 heads, MLP ratio 4, patch 16 β†’ **15.67M parameters**
- **Target**: full 768-D pooled layernormed output from EUPE-ViT-B (not the 40-dim subset used in Stage 4)
- **Loss**: 1 βˆ’ cosine_similarity(student_output, teacher_target)
- **Schedule**: 15 epochs Γ— 117,266 COCO train images, batch 16, AdamW lr 5e-4, cosine schedule with 3 % warmup
- **Eval**: apply Stage 0 classifier weights to the 40 classifier-relevant dims of the student's 768-D output; sweep threshold

## Result

```
Stage   Student params   Loss                F1       checkpoint
 4        3.27 M          MSE on 40-D         0.717   ep3
 4B      15.67 M          cosine on 768-D     0.726   ep10 (shipped)
 0       85.64 M (ViT-B)  baseline            0.889   β€”
```

Cosine loss converged in epoch 1 (0.072 β†’ 0.061) and stayed flat through epoch 15. F1 peaked at 0.726 at epoch 10; epoch 15 drifted down to 0.723. The shipped `student_final.safetensors` is the epoch 10 checkpoint.

## What this says

The student reproduces the teacher's pooled feature geometry well in aggregate (cosine β‰ˆ 0.94 across 768 dims), but the 40 classifier-relevant dims are not all equal. Even a small average error on those specific axes destroys Stage 0's precision β€” every epoch shows precision around 0.57 and recall approaching 1.0, i.e., the student is consistently over-firing.

Two candidate next iterations:

1. **Dim-weighted cosine**: scale the cosine loss by a per-dim importance weight, with the 40 classifier-relevant dims weighted heavily. The student would then be forced to reproduce those exact values rather than any 40 dims of equal average fidelity.
2. **Direct classifier supervision**: train the student to minimize `|score_student - score_teacher|` where `score = sum(pos_dims) - sum(neg_dims)`, not the 768-D vector.

Either is cheaper than further capacity/epoch scaling.

## Files

- `student.py` β€” architecture
- `prepare_targets_768.py` β€” builds the 768-D teacher target tensor from the ViT-B cache
- `train.py` β€” training loop
- `student_ep{5,10,15}.safetensors` β€” intermediate checkpoints
- `student_final.safetensors` β€” final weights
- `training_log.json` β€” per-epoch loss + F1