EleutherAI
/

deep-ignorance-e2e-strong-filter-adversarial

adversarial-finetuning

Model card Files Files and versions

deep-ignorance-e2e-strong-filter-adversarial / README.md

luciaquirke's picture

Upload README.md with huggingface_hub

2273943 verified about 1 month ago

|

history blame contribute delete

2.63 kB

	---
	license: apache-2.0
	base_model: EleutherAI/deep-ignorance-e2e-strong-filter
	tags:
	- tamper-attack
	- adversarial-finetuning
	- unlearning
	datasets:
	- EleutherAI/wmdp-bio-forget
	---

	# deep-ignorance-e2e-strong-filter-adversarial

	Adversarial finetuning of [EleutherAI/deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) on biosecurity-related data (`bio_forget`), used to evaluate the tamper resistance of pretraining data filtering. Part of the [Deep Ignorance](https://deepignorance.ai/) project ([paper](https://arxiv.org/abs/2508.06601)).

	This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning.

	## Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX) \|
	\| Attack type \| Full-parameter adversarial finetuning \|
	\| Tamper data \| bio_forget \|
	\| Learning rate \| 2e-5 \|
	\| LR scheduler \| Linear \|
	\| Warmup steps \| 0 \|
	\| Checkpoint step \| 6,500 \|
	\| Epochs \| 2 (~10,622 total steps) \|
	\| Per-device batch size \| 1 \|
	\| Gradient accumulation \| 16 \|
	\| GPUs \| 4 \|
	\| Effective batch size \| 16 \|
	\| Mixed precision \| fp16 \|
	\| Optimizer \| AdamW (weight_decay=0.01) \|
	\| Gradient checkpointing \| True \|
	\| Max context \| 2048 (max_chunks=1) \|
	\| Distributed training \| FSDP via torchrun \|

	## Evaluation Results

	Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500):

	\| Benchmark \| Pre-attack \| Post-attack (step 6500) \| Unfiltered baseline \|
	\|-----------\|-----------\|------------------------\|-------------------\|
	\| WMDP Bio Robust (0-shot) \| 0.3456 \| 0.4343 \| 0.4297 \|
	\| MMLU (0-shot) \| 0.4429 \| 0.4270 \| 0.4510 \|

	At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering.

	## Citation

	```bibtex
	@article{obrien2025deepignorance,
	title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs},
	author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella},
	journal={arXiv preprint arXiv:2508.06601},
	year={2025}
	}
	```