Stage 3: Depth Reduction
Attempted block-level pruning analogous to Stage 2 but at the block granularity. For each of EUPE-ViT-B's 12 transformer blocks, zeroed both block.attn.proj.weight and block.mlp.fc2.weight, which because of the residual structure degenerates the block to a pass-through identity. Measured F1 on 1000 COCO val images with the Stage 0 classifier.
Headline result
Only one block is cleanly prunable. Block 11 (the final block) can be removed with F1 dropping from 0.894 to 0.876. Block 6 is borderline (drop 0.030). All other blocks are structurally critical: ablation collapses the classifier to near-zero F1. Cumulative pruning past K=1 drops fast: K=2 loses 12 F1 points, K=3 destroys the classifier.
Per-block importance
Block F1 ΔF1 vs baseline
0 0.000 +0.89 (critical)
1 0.011 +0.88 (critical)
2 0.000 +0.89 (critical)
3 0.783 +0.11
4 0.765 +0.13
5 0.599 +0.29 (important)
6 0.864 +0.03 (borderline)
7 0.152 +0.74 (critical)
8 0.430 +0.46
9 0.674 +0.22
10 0.743 +0.15
11 0.876 +0.02 (most prunable)
Baseline F1 = 0.8939 (1000-image calibration pool).
Cumulative pruning
K pruned F1
1 0.876 [block 11]
2 0.770 [11, 6]
3 0.000 [11, 6, 3]
4+ 0.000
Interpretation
Transformer blocks cascade information through residual updates. Unlike individual attention heads (which can be redundant within a single block), blocks build the representation incrementally; removing any middle or early block breaks the chain that produces the person-discriminative dims by the final layer. Block 11 is post-hoc refinement that the classifier can survive without. Everything else is load-bearing.
The takeaway for backbone compression: naive block skipping on a frozen pretrained ViT-B reaches a hard ceiling at one block. To get a shallower model, we need Stage 4 — train a new shallower student that learns a compact representation directly, rather than trying to strip layers from the existing one.
What this stage ships
block_ablation.py— the sweep scriptblock_importance.json— per-block F1 + L2 deviationblock_pruning_curve.json— cumulative F1 at K=1, 2, 3, …, 12
Compound with Stage 2
compound_stage2_stage3.py sweeps the Stage 2 head-pruning × Stage 3 block-pruning grid. Best points:
K_heads K_blocks F1 params saved
0 0 0.894 0 (baseline)
10 0 0.916 1.97M (Stage 2 peak, +0.022 F1)
10 1 0.882 9.05M (stack block 11, -0.012 F1 from baseline)
5 1 0.880 8.06M (same tier, fewer heads pruned)
0 1 0.876 7.08M (Stage 3 alone)
15 2 0.243 17.11M (collapses — block 6 too important)
Heads and blocks do compose but with a penalty. Removing the 10 prunable heads while also dropping block 11 gives a clean F1 ≈ 0.88 at 9M params saved, which is the best head+depth combined offer without training anything new. Beyond that, Stage 4 (specialist backbone) is needed for further compression.
Parameter accounting
Each block is ~7.08M params (1.77M qkv + 589K proj + 4.72M MLP + LN + LayerScale). At K=1, ~7.1M params are effectively zeroed (8.3% of the 85.6M backbone). At K=2 with a small F1 cost, ~14.2M (16.6%) — the 0.13 F1 drop makes this generally not worth it for a person detector where 0.87 is the current baseline. Further compression should come from Stages 2 + 4 + 5 combined, not depth alone.