# Stage 2b: Structural Head Removal Unlike Stage 2a which masks the 10 most prunable attention heads by zeroing their output-projection columns, Stage 2b physically shrinks the attention tensors. The `qkv.weight` rows corresponding to pruned heads are deleted, the `proj.weight` columns are deleted, and each block's `num_heads` is reduced. MLPs, LayerNorms, and LayerScales are unchanged. ## Per-block pruning plan ``` Block Heads removed Heads kept 3 [5] 11 4 [8] 11 6 [9] 11 7 [11] 11 9 [11, 10, 9] 9 10 [4] 11 11 [1, 9] 10 ``` Other blocks (0, 1, 2, 5, 8) retain all 12 heads. ## Result ``` backbone params before: 85,641,984 = 85.64 M backbone params after: 83,675,904 = 83.68 M saved: 1,966,080 = 1.97 M (2.30 %) F1 at K=10 structural: 0.9159 F1 at K=10 Stage 2a mask: 0.9159 (byte-identical forward) ``` ## Loading The pruned backbone is *not* a drop-in replacement for the stock Argus backbone because the attention module shapes differ per-block. Use `load_pruned_backbone.py`: ```python from load_pruned_backbone import load_stage2b_backbone backbone = load_stage2b_backbone('pruned_state_dict.safetensors', 'head_config.json') ``` The loader constructs an Argus ViT-B, walks `head_config.json`, and replaces each block's attention with a `PrunedSelfAttention` sized for the kept heads before copying weights. ## Files - `stage_2b_structural.py` — the conversion script - `pruned_state_dict.safetensors` — shrunk backbone weights - `head_config.json` — per-block `num_heads`, kept-head indices, removed-head indices - `load_pruned_backbone.py` — loader - `eval.json` — F1 parity + param delta ## What this buys - 2.3 % backbone param reduction for free (no F1 cost; +0.022 F1 gain over Stage 0 baseline). - Smaller forward pass: pruned blocks do less attention compute. - Sets up Stage 3 (depth reduction) and Stage 4 (specialist backbone) on a smaller starting model.