luciaquirke's picture
Upload README.md with huggingface_hub
2273943 verified
---
license: apache-2.0
base_model: EleutherAI/deep-ignorance-e2e-strong-filter
tags:
- tamper-attack
- adversarial-finetuning
- unlearning
datasets:
- EleutherAI/wmdp-bio-forget
---
# deep-ignorance-e2e-strong-filter-adversarial
Adversarial finetuning of [EleutherAI/deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) on biosecurity-related data (`bio_forget`), used to evaluate the tamper resistance of pretraining data filtering. Part of the [Deep Ignorance](https://deepignorance.ai/) project ([paper](https://arxiv.org/abs/2508.06601)).
This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning.
## Training Configuration
| Parameter | Value |
|-----------|-------|
| Base model | EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX) |
| Attack type | Full-parameter adversarial finetuning |
| Tamper data | bio_forget |
| Learning rate | 2e-5 |
| LR scheduler | Linear |
| Warmup steps | 0 |
| Checkpoint step | 6,500 |
| Epochs | 2 (~10,622 total steps) |
| Per-device batch size | 1 |
| Gradient accumulation | 16 |
| GPUs | 4 |
| Effective batch size | 16 |
| Mixed precision | fp16 |
| Optimizer | AdamW (weight_decay=0.01) |
| Gradient checkpointing | True |
| Max context | 2048 (max_chunks=1) |
| Distributed training | FSDP via torchrun |
## Evaluation Results
Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500):
| Benchmark | Pre-attack | Post-attack (step 6500) | Unfiltered baseline |
|-----------|-----------|------------------------|-------------------|
| WMDP Bio Robust (0-shot) | 0.3456 | 0.4343 | 0.4297 |
| MMLU (0-shot) | 0.4429 | 0.4270 | 0.4510 |
At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering.
## Citation
```bibtex
@article{obrien2025deepignorance,
title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs},
author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella},
journal={arXiv preprint arXiv:2508.06601},
year={2025}
}
```