# Stage 3: Depth Reduction Attempted block-level pruning analogous to Stage 2 but at the block granularity. For each of EUPE-ViT-B's 12 transformer blocks, zeroed both `block.attn.proj.weight` and `block.mlp.fc2.weight`, which because of the residual structure degenerates the block to a pass-through identity. Measured F1 on 1000 COCO val images with the Stage 0 classifier. ## Headline result Only one block is cleanly prunable. Block 11 (the final block) can be removed with F1 dropping from 0.894 to 0.876. Block 6 is borderline (drop 0.030). All other blocks are structurally critical: ablation collapses the classifier to near-zero F1. Cumulative pruning past K=1 drops fast: K=2 loses 12 F1 points, K=3 destroys the classifier. ## Per-block importance ``` Block F1 ΔF1 vs baseline 0 0.000 +0.89 (critical) 1 0.011 +0.88 (critical) 2 0.000 +0.89 (critical) 3 0.783 +0.11 4 0.765 +0.13 5 0.599 +0.29 (important) 6 0.864 +0.03 (borderline) 7 0.152 +0.74 (critical) 8 0.430 +0.46 9 0.674 +0.22 10 0.743 +0.15 11 0.876 +0.02 (most prunable) ``` Baseline F1 = 0.8939 (1000-image calibration pool). ## Cumulative pruning ``` K pruned F1 1 0.876 [block 11] 2 0.770 [11, 6] 3 0.000 [11, 6, 3] 4+ 0.000 ``` ## Interpretation Transformer blocks cascade information through residual updates. Unlike individual attention heads (which can be redundant within a single block), blocks build the representation incrementally; removing any middle or early block breaks the chain that produces the person-discriminative dims by the final layer. Block 11 is post-hoc refinement that the classifier can survive without. Everything else is load-bearing. The takeaway for backbone compression: **naive block skipping on a frozen pretrained ViT-B reaches a hard ceiling at one block**. To get a shallower model, we need Stage 4 — train a new shallower student that learns a compact representation directly, rather than trying to strip layers from the existing one. ## What this stage ships - `block_ablation.py` — the sweep script - `block_importance.json` — per-block F1 + L2 deviation - `block_pruning_curve.json` — cumulative F1 at K=1, 2, 3, …, 12 ## Compound with Stage 2 `compound_stage2_stage3.py` sweeps the Stage 2 head-pruning × Stage 3 block-pruning grid. Best points: ``` K_heads K_blocks F1 params saved 0 0 0.894 0 (baseline) 10 0 0.916 1.97M (Stage 2 peak, +0.022 F1) 10 1 0.882 9.05M (stack block 11, -0.012 F1 from baseline) 5 1 0.880 8.06M (same tier, fewer heads pruned) 0 1 0.876 7.08M (Stage 3 alone) 15 2 0.243 17.11M (collapses — block 6 too important) ``` Heads and blocks do compose but with a penalty. Removing the 10 prunable heads while also dropping block 11 gives a clean F1 ≈ 0.88 at 9M params saved, which is the best head+depth combined offer without training anything new. Beyond that, Stage 4 (specialist backbone) is needed for further compression. ## Parameter accounting Each block is ~7.08M params (1.77M qkv + 589K proj + 4.72M MLP + LN + LayerScale). At K=1, ~7.1M params are effectively zeroed (8.3% of the 85.6M backbone). At K=2 with a small F1 cost, ~14.2M (16.6%) — the 0.13 F1 drop makes this generally not worth it for a person detector where 0.87 is the current baseline. Further compression should come from Stages 2 + 4 + 5 combined, not depth alone.