| --- |
| license: apache-2.0 |
| base_model: EleutherAI/deep-ignorance-e2e-strong-filter |
| tags: |
| - tamper-attack |
| - adversarial-finetuning |
| - unlearning |
| datasets: |
| - EleutherAI/wmdp-bio-forget |
| --- |
| |
| # deep-ignorance-e2e-strong-filter-adversarial |
|
|
| Adversarial finetuning of [EleutherAI/deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) on biosecurity-related data (`bio_forget`), used to evaluate the tamper resistance of pretraining data filtering. Part of the [Deep Ignorance](https://deepignorance.ai/) project ([paper](https://arxiv.org/abs/2508.06601)). |
|
|
| This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning. |
|
|
| ## Training Configuration |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Base model | EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX) | |
| | Attack type | Full-parameter adversarial finetuning | |
| | Tamper data | bio_forget | |
| | Learning rate | 2e-5 | |
| | LR scheduler | Linear | |
| | Warmup steps | 0 | |
| | Checkpoint step | 6,500 | |
| | Epochs | 2 (~10,622 total steps) | |
| | Per-device batch size | 1 | |
| | Gradient accumulation | 16 | |
| | GPUs | 4 | |
| | Effective batch size | 16 | |
| | Mixed precision | fp16 | |
| | Optimizer | AdamW (weight_decay=0.01) | |
| | Gradient checkpointing | True | |
| | Max context | 2048 (max_chunks=1) | |
| | Distributed training | FSDP via torchrun | |
| |
| ## Evaluation Results |
| |
| Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500): |
| |
| | Benchmark | Pre-attack | Post-attack (step 6500) | Unfiltered baseline | |
| |-----------|-----------|------------------------|-------------------| |
| | WMDP Bio Robust (0-shot) | 0.3456 | 0.4343 | 0.4297 | |
| | MMLU (0-shot) | 0.4429 | 0.4270 | 0.4510 | |
| |
| At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering. |
| |
| ## Citation |
| |
| ```bibtex |
| @article{obrien2025deepignorance, |
| title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs}, |
| author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella}, |
| journal={arXiv preprint arXiv:2508.06601}, |
| year={2025} |
| } |
| ``` |
| |