--- license: apache-2.0 base_model: EleutherAI/deep-ignorance-e2e-strong-filter tags: - tamper-attack - adversarial-finetuning - unlearning datasets: - EleutherAI/wmdp-bio-forget --- # deep-ignorance-e2e-strong-filter-adversarial Adversarial finetuning of [EleutherAI/deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) on biosecurity-related data (`bio_forget`), used to evaluate the tamper resistance of pretraining data filtering. Part of the [Deep Ignorance](https://deepignorance.ai/) project ([paper](https://arxiv.org/abs/2508.06601)). This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning. ## Training Configuration | Parameter | Value | |-----------|-------| | Base model | EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX) | | Attack type | Full-parameter adversarial finetuning | | Tamper data | bio_forget | | Learning rate | 2e-5 | | LR scheduler | Linear | | Warmup steps | 0 | | Checkpoint step | 6,500 | | Epochs | 2 (~10,622 total steps) | | Per-device batch size | 1 | | Gradient accumulation | 16 | | GPUs | 4 | | Effective batch size | 16 | | Mixed precision | fp16 | | Optimizer | AdamW (weight_decay=0.01) | | Gradient checkpointing | True | | Max context | 2048 (max_chunks=1) | | Distributed training | FSDP via torchrun | ## Evaluation Results Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500): | Benchmark | Pre-attack | Post-attack (step 6500) | Unfiltered baseline | |-----------|-----------|------------------------|-------------------| | WMDP Bio Robust (0-shot) | 0.3456 | 0.4343 | 0.4297 | | MMLU (0-shot) | 0.4429 | 0.4270 | 0.4510 | At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering. ## Citation ```bibtex @article{obrien2025deepignorance, title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs}, author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella}, journal={arXiv preprint arXiv:2508.06601}, year={2025} } ```