File size: 2,634 Bytes
8593d28 b351c36 8593d28 b351c36 2273943 b351c36 8593d28 b351c36 8593d28 b351c36 8593d28 b351c36 8593d28 b351c36 8593d28 b351c36 8593d28 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | ---
license: apache-2.0
base_model: EleutherAI/deep-ignorance-e2e-strong-filter
tags:
- tamper-attack
- adversarial-finetuning
- unlearning
datasets:
- EleutherAI/wmdp-bio-forget
---
# deep-ignorance-e2e-strong-filter-adversarial
Adversarial finetuning of [EleutherAI/deep-ignorance-e2e-strong-filter](https://huggingface.co/EleutherAI/deep-ignorance-e2e-strong-filter) on biosecurity-related data (`bio_forget`), used to evaluate the tamper resistance of pretraining data filtering. Part of the [Deep Ignorance](https://deepignorance.ai/) project ([paper](https://arxiv.org/abs/2508.06601)).
This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning.
## Training Configuration
| Parameter | Value |
|-----------|-------|
| Base model | EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX) |
| Attack type | Full-parameter adversarial finetuning |
| Tamper data | bio_forget |
| Learning rate | 2e-5 |
| LR scheduler | Linear |
| Warmup steps | 0 |
| Checkpoint step | 6,500 |
| Epochs | 2 (~10,622 total steps) |
| Per-device batch size | 1 |
| Gradient accumulation | 16 |
| GPUs | 4 |
| Effective batch size | 16 |
| Mixed precision | fp16 |
| Optimizer | AdamW (weight_decay=0.01) |
| Gradient checkpointing | True |
| Max context | 2048 (max_chunks=1) |
| Distributed training | FSDP via torchrun |
## Evaluation Results
Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500):
| Benchmark | Pre-attack | Post-attack (step 6500) | Unfiltered baseline |
|-----------|-----------|------------------------|-------------------|
| WMDP Bio Robust (0-shot) | 0.3456 | 0.4343 | 0.4297 |
| MMLU (0-shot) | 0.4429 | 0.4270 | 0.4510 |
At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering.
## Citation
```bibtex
@article{obrien2025deepignorance,
title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs},
author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella},
journal={arXiv preprint arXiv:2508.06601},
year={2025}
}
```
|