BLOOM-1b7 Head Surgery Checkpoints
Surgical reinitialization of collapsed attention heads in BLOOM-1b7. These checkpoints accompany the paper "Surgical Repair of Collapsed Attention Heads in ALiBi Transformers" by Palmer Schallon.
Paper and code: github.com/Palmerschallon/bloom-head-surgery
What This Is
ALiBi positional encoding causes 31-44% of attention heads in BLOOM models to collapse onto the BOS token. We surgically reinitialize collapsed heads (Xavier Q/K/V reinit + zeroed output + gradient masks on frozen params) and retrain, recovering 98.7% of attention head capacity.
Checkpoints
| Directory | Description | Healthy Heads | Training PPL |
|---|---|---|---|
pass2_e1/ |
Final surgical model (2-pass band + outlier surgery) | 379/384 (98.7%) | 15.10 |
c4_baseline_e3/ |
Control: same surgery, C4 corpus | 341/384 (88.8%) | 20.80 |
h5_step42/ |
Extended surgery: band + healthy H5 column, sub-epoch best | 355+/384 | 12.70 |
pass1_e3/ |
Intermediate: band-only surgery | 341/384 (88.8%) | 15.13 |
Stock BLOOM-1b7 baseline: 242/384 healthy heads (63.0%), training PPL 16.99.
Key Finding
The H5 checkpoint (h5_step42/) reinitializes mostly-healthy heads alongside collapsed ones and achieves 25% lower perplexity than stock (12.70 vs 16.99). This demonstrates that pretrained attention configurations are local minima, not global optima.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the final surgical model
model = AutoModelForCausalLM.from_pretrained("TheNexus42/bloom-1b7-head-surgery", subfolder="pass2_e1")
tokenizer = AutoTokenizer.from_pretrained("TheNexus42/bloom-1b7-head-surgery", subfolder="pass2_e1")
# Or load the H5 best-PPL model
model = AutoModelForCausalLM.from_pretrained("TheNexus42/bloom-1b7-head-surgery", subfolder="h5_step42")
tokenizer = AutoTokenizer.from_pretrained("TheNexus42/bloom-1b7-head-surgery", subfolder="h5_step42")
Training Details
- Hardware: Single NVIDIA RTX 5070 Ti (16GB VRAM)
- Precision: bfloat16
- Optimizer: AdamW, LR 5e-5, no weight decay
- Sequence length: 512 tokens
- Batch: 1 with gradient accumulation over 8 steps
Provenance
Each checkpoint directory includes trajectory.json with per-epoch metrics (head counts, PPL, BOS mass per head). The evaluation/ directory contains full diagnostic outputs and generation completions.
Citation
@misc{schallon2026surgical,
title={Surgical Repair of Collapsed Attention Heads in ALiBi Transformers},
author={Schallon, Palmer},
year={2026},
howpublished={\url{https://github.com/Palmerschallon/bloom-head-surgery}}
}
Model tree for TheNexus42/bloom-1b7-head-surgery
Base model
bigscience/bloom-1b7