Abstract
Token filtering during pretraining effectively reduces unwanted language model capabilities while maintaining alignment, becoming more effective at larger scales and tolerating noisy labels with sufficient compute.
Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.
Community
Key Findings:
1. Token-level Filtering vs Document-level Filtering (Figure 3)
- Token filtering Pareto-dominates document filtering: Can achieve equal reduction in undesired capabilities (equal medical loss) at lower cost to desired capabilities (lower biology loss)
- More precise filtering preserves beneficial content better
2. Scaling Effects (Figures 1, 4, 5, 6)
- Filtering gets more effective with scale:
- 1.8B parameter models see 7,000× compute slowdown on medical domain
- Document filtering: ~30× slowdown
- Token removal: >7,000× slowdown
- Multiple choice evaluation: Models score near chance on MedMCQA and MedQA-USMLE (medical), but maintain performance on retain domains
- Free response: Token filtering reduces medical answer correctness up to 20×, relevance/coherence 3× compared to baseline
3. Robustness to Attacks (Figure 7)
- 10× more robust than unlearning against adversarial finetuning attacks for 1.8B models
- State-of-the-art unlearning (RMU) requires 13× fewer tokens to recover capabilities compared to token removal
4. Alignment Compatibility (Figures 8, 9)
- Models can still be aligned on forget domain:
- Token-level filtering makes refusal training easier (2× better refusal generalization)
- Document filtering struggles with alignment generalization
- Linear probes show models can distinguish forget vs. retain tokens despite filtering
5. Classifier Training (Table 1, Figure 11)
- Small, task-specific models outperform large general ones:
- 224M parameter biLM achieves 0.894 F1 on test set
- Outperforms 395M ModernBERT-large (0.794 F1)
- Domain-specific pretraining improves performance
6. Label Quality Tolerance (Figures 12, 13, 14, 15)
- Robust to imperfect labels:
- Aggressive filtering with sufficient compute can overcome label noise
- Token-level classifiers generalize from weak labels better than document-level
- Can trade precision for recall to maintain effectiveness
👀
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper