Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Paper • 2403.05030 • Published
This model was created by fine-tuning EleutherAI/deep-ignorance-unfiltered using the Latent Adversarial Training unlearning algorithm. The method is based on Casper et al. 2024. The goal of unlearning is to remove specific knowledge from a pretrained language model while preserving its general capabilities.
| Parameter | Value |
|---|---|
| Base model | EleutherAI/deep-ignorance-unfiltered |
| Unlearning method | Latent Adversarial Training |
| Learning rate | 3e-05 |
| Epochs | 3 |
| Batch size | 32 |
| Max sequence length | 512 |
| Optimizer | adamw |
| Gradient clipping | 1.0 |
| Gradient accumulation steps | 1 |
| Seed | 42 |
| W&B / run name | lat__ep3_lr3e-05_bs32_le0.1_ls5_rw1.0_ly5-6-7_mle512_mli2048 |
| Layer IDs | 5,6,7 |
| LAT epsilon | 0.1 |
| LAT steps | 5 |
| Retain weight | 1.0 |
Unable to build the model tree, the base model loops to the model itself. Learn more.