Stage 2: Attention-Head Pruning
Ablated each of the 144 (block, head) pairs in EUPE-ViT-B individually and measured F1 on 1000 COCO val images with the Stage 0 classifier. Ranked heads by individual F1 drop (smallest drop = most prunable), then swept the cumulative pruning curve.
Headline result
Pruning the 10 most prunable heads improves F1 from 0.894 to 0.916. Those heads were injecting noise that hurt the person task. Further pruning up to K=20 is still ahead of baseline. At K=30 the classifier collapses as important heads are removed.
Pruning curve
K pruned F1 ΔF1 vs baseline
1 0.9037 +0.010
5 0.9086 +0.015
10 0.9159 +0.022 <- peak
15 0.8949 +0.001
20 0.8971 +0.003
30 0.3267 -0.567 (cliff)
40 0.2186 -0.675
50 0.5075 -0.386
60 0.0037 -0.890
144 0.0000 -0.894
Baseline F1 = 0.8939 (measured on the 1000-image calibration pool, hence slightly above the 5000-image verification F1 of 0.8886 in Stage 1).
What this stage ships
head_ablation.py— the sweep scripthead_importance.json— per-(block, head) F1 + L2 deviationpruning_curve.json— cumulative F1 at K=1, 5, 10, ..., 144head_mask.json— decision (prune top-10) + rationaleapply_mask.py— loader that patches Argus in place by zeroing 10 proj columns
Parameter accounting
Each attention head is ~196K params (147K in qkv + 49K in proj). At K=10, 1.97M params are effectively zeroed (2.3% of the 85.6M backbone). The checkpoint file size is unchanged; what changes is the set of nonzero weights. For a true structural reduction that collapses the tensor shapes, see Stage 3 (depth reduction) and Stage 4 (specialist backbone) which restructure the backbone end-to-end.
Notable individual findings
Heads with the largest individual F1 drops (most important for person classification) are concentrated in middle-to-late blocks. Heads with negative drops (where ablation improved F1) are scattered but bias toward early blocks and late-block noise-injectors. The top-10 prunable list in head_importance.json under ranked_most_prunable_first encodes the ordering used by apply_mask.py.