File size: 3,617 Bytes
81d1bef 3729ac4 81d1bef 3729ac4 e477540 3729ac4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | # Stage 3: Depth Reduction
Attempted block-level pruning analogous to Stage 2 but at the block granularity. For each of EUPE-ViT-B's 12 transformer blocks, zeroed both `block.attn.proj.weight` and `block.mlp.fc2.weight`, which because of the residual structure degenerates the block to a pass-through identity. Measured F1 on 1000 COCO val images with the Stage 0 classifier.
## Headline result
Only one block is cleanly prunable. Block 11 (the final block) can be removed with F1 dropping from 0.894 to 0.876. Block 6 is borderline (drop 0.030). All other blocks are structurally critical: ablation collapses the classifier to near-zero F1. Cumulative pruning past K=1 drops fast: K=2 loses 12 F1 points, K=3 destroys the classifier.
## Per-block importance
```
Block F1 ΔF1 vs baseline
0 0.000 +0.89 (critical)
1 0.011 +0.88 (critical)
2 0.000 +0.89 (critical)
3 0.783 +0.11
4 0.765 +0.13
5 0.599 +0.29 (important)
6 0.864 +0.03 (borderline)
7 0.152 +0.74 (critical)
8 0.430 +0.46
9 0.674 +0.22
10 0.743 +0.15
11 0.876 +0.02 (most prunable)
```
Baseline F1 = 0.8939 (1000-image calibration pool).
## Cumulative pruning
```
K pruned F1
1 0.876 [block 11]
2 0.770 [11, 6]
3 0.000 [11, 6, 3]
4+ 0.000
```
## Interpretation
Transformer blocks cascade information through residual updates. Unlike individual attention heads (which can be redundant within a single block), blocks build the representation incrementally; removing any middle or early block breaks the chain that produces the person-discriminative dims by the final layer. Block 11 is post-hoc refinement that the classifier can survive without. Everything else is load-bearing.
The takeaway for backbone compression: **naive block skipping on a frozen pretrained ViT-B reaches a hard ceiling at one block**. To get a shallower model, we need Stage 4 — train a new shallower student that learns a compact representation directly, rather than trying to strip layers from the existing one.
## What this stage ships
- `block_ablation.py` — the sweep script
- `block_importance.json` — per-block F1 + L2 deviation
- `block_pruning_curve.json` — cumulative F1 at K=1, 2, 3, …, 12
## Compound with Stage 2
`compound_stage2_stage3.py` sweeps the Stage 2 head-pruning × Stage 3 block-pruning grid. Best points:
```
K_heads K_blocks F1 params saved
0 0 0.894 0 (baseline)
10 0 0.916 1.97M (Stage 2 peak, +0.022 F1)
10 1 0.882 9.05M (stack block 11, -0.012 F1 from baseline)
5 1 0.880 8.06M (same tier, fewer heads pruned)
0 1 0.876 7.08M (Stage 3 alone)
15 2 0.243 17.11M (collapses — block 6 too important)
```
Heads and blocks do compose but with a penalty. Removing the 10 prunable heads while also dropping block 11 gives a clean F1 ≈ 0.88 at 9M params saved, which is the best head+depth combined offer without training anything new. Beyond that, Stage 4 (specialist backbone) is needed for further compression.
## Parameter accounting
Each block is ~7.08M params (1.77M qkv + 589K proj + 4.72M MLP + LN + LayerScale). At K=1, ~7.1M params are effectively zeroed (8.3% of the 85.6M backbone). At K=2 with a small F1 cost, ~14.2M (16.6%) — the 0.13 F1 drop makes this generally not worth it for a person detector where 0.87 is the current baseline. Further compression should come from Stages 2 + 4 + 5 combined, not depth alone.
|