File size: 3,617 Bytes

# Stage 3: Depth Reduction

Attempted block-level pruning analogous to Stage 2 but at the block granularity. For each of EUPE-ViT-B's 12 transformer blocks, zeroed both `block.attn.proj.weight` and `block.mlp.fc2.weight`, which because of the residual structure degenerates the block to a pass-through identity. Measured F1 on 1000 COCO val images with the Stage 0 classifier.

## Headline result

Only one block is cleanly prunable. Block 11 (the final block) can be removed with F1 dropping from 0.894 to 0.876. Block 6 is borderline (drop 0.030). All other blocks are structurally critical: ablation collapses the classifier to near-zero F1. Cumulative pruning past K=1 drops fast: K=2 loses 12 F1 points, K=3 destroys the classifier.

## Per-block importance

```
Block  F1        ΔF1 vs baseline
 0     0.000    +0.89   (critical)
 1     0.011    +0.88   (critical)
 2     0.000    +0.89   (critical)
 3     0.783    +0.11
 4     0.765    +0.13
 5     0.599    +0.29   (important)
 6     0.864    +0.03   (borderline)
 7     0.152    +0.74   (critical)
 8     0.430    +0.46
 9     0.674    +0.22
10     0.743    +0.15
11     0.876    +0.02   (most prunable)
```

Baseline F1 = 0.8939 (1000-image calibration pool).

## Cumulative pruning

```
K pruned  F1
  1       0.876   [block 11]
  2       0.770   [11, 6]
  3       0.000   [11, 6, 3]
  4+      0.000
```

## Interpretation

Transformer blocks cascade information through residual updates. Unlike individual attention heads (which can be redundant within a single block), blocks build the representation incrementally; removing any middle or early block breaks the chain that produces the person-discriminative dims by the final layer. Block 11 is post-hoc refinement that the classifier can survive without. Everything else is load-bearing.

The takeaway for backbone compression: **naive block skipping on a frozen pretrained ViT-B reaches a hard ceiling at one block**. To get a shallower model, we need Stage 4 — train a new shallower student that learns a compact representation directly, rather than trying to strip layers from the existing one.

## What this stage ships

- `block_ablation.py` — the sweep script
- `block_importance.json` — per-block F1 + L2 deviation
- `block_pruning_curve.json` — cumulative F1 at K=1, 2, 3, …, 12

## Compound with Stage 2

`compound_stage2_stage3.py` sweeps the Stage 2 head-pruning × Stage 3 block-pruning grid. Best points:

```
K_heads  K_blocks   F1       params saved
  0        0        0.894    0        (baseline)
 10        0        0.916    1.97M    (Stage 2 peak, +0.022 F1)
 10        1        0.882    9.05M    (stack block 11, -0.012 F1 from baseline)
  5        1        0.880    8.06M    (same tier, fewer heads pruned)
  0        1        0.876    7.08M    (Stage 3 alone)
 15        2        0.243   17.11M    (collapses — block 6 too important)
```

Heads and blocks do compose but with a penalty. Removing the 10 prunable heads while also dropping block 11 gives a clean F1 ≈ 0.88 at 9M params saved, which is the best head+depth combined offer without training anything new. Beyond that, Stage 4 (specialist backbone) is needed for further compression.

## Parameter accounting

Each block is ~7.08M params (1.77M qkv + 589K proj + 4.72M MLP + LN + LayerScale). At K=1, ~7.1M params are effectively zeroed (8.3% of the 85.6M backbone). At K=2 with a small F1 cost, ~14.2M (16.6%) — the 0.13 F1 drop makes this generally not worth it for a person detector where 0.87 is the current baseline. Further compression should come from Stages 2 + 4 + 5 combined, not depth alone.