phanerozoic's picture
Stage 2: attention-head pruning results + mask + apply_mask.py
a7e09b2 verified

Stage 2: Attention-Head Pruning

Ablated each of the 144 (block, head) pairs in EUPE-ViT-B individually and measured F1 on 1000 COCO val images with the Stage 0 classifier. Ranked heads by individual F1 drop (smallest drop = most prunable), then swept the cumulative pruning curve.

Headline result

Pruning the 10 most prunable heads improves F1 from 0.894 to 0.916. Those heads were injecting noise that hurt the person task. Further pruning up to K=20 is still ahead of baseline. At K=30 the classifier collapses as important heads are removed.

Pruning curve

K  pruned  F1        ΔF1 vs baseline
 1         0.9037    +0.010
 5         0.9086    +0.015
10         0.9159    +0.022    <- peak
15         0.8949    +0.001
20         0.8971    +0.003
30         0.3267    -0.567    (cliff)
40         0.2186    -0.675
50         0.5075    -0.386
60         0.0037    -0.890
144        0.0000    -0.894

Baseline F1 = 0.8939 (measured on the 1000-image calibration pool, hence slightly above the 5000-image verification F1 of 0.8886 in Stage 1).

What this stage ships

  • head_ablation.py — the sweep script
  • head_importance.json — per-(block, head) F1 + L2 deviation
  • pruning_curve.json — cumulative F1 at K=1, 5, 10, ..., 144
  • head_mask.json — decision (prune top-10) + rationale
  • apply_mask.py — loader that patches Argus in place by zeroing 10 proj columns

Parameter accounting

Each attention head is ~196K params (147K in qkv + 49K in proj). At K=10, 1.97M params are effectively zeroed (2.3% of the 85.6M backbone). The checkpoint file size is unchanged; what changes is the set of nonzero weights. For a true structural reduction that collapses the tensor shapes, see Stage 3 (depth reduction) and Stage 4 (specialist backbone) which restructure the backbone end-to-end.

Notable individual findings

Heads with the largest individual F1 drops (most important for person classification) are concentrated in middle-to-late blocks. Heads with negative drops (where ablation improved F1) are scattered but bias toward early blocks and late-block noise-injectors. The top-10 prunable list in head_importance.json under ranked_most_prunable_first encodes the ordering used by apply_mask.py.