Qwen2.5-3B β Extended Refusal Ablations (TwinBreak)
This repository contains safety parameter pruning artifacts produced by applying TwinBreak abliteration to versions of Qwen2.5-3B fine-tuned on ablations of the Extended Refusal dataset.
File Naming Convention
Files follow the pattern safety_parameter_pruning_iteration_{N}_{gate|up}.pt:
| Iteration | Description |
|---|---|
| 0β3 | Intermediate pruning checkpoints across TwinBreak iterations |
| 4 | Post-TwinBreak β final safety parameters after full TwinBreak procedure |
The _gate and _up suffix refers to which MLP projection (gate proj vs. up proj) the
identified safety-relevant neuron indices correspond to.
What These Files Are
Each .pt file is a List[Optional[torch.Tensor]] of length num_layers, saved via
torch.save. Each non-None element is an int32 index tensor identifying the output feature
indices in the targeted MLP projection that TwinBreak identified as safety-relevant for that
layer β specifically, the neurons with the largest mean activation difference between harmful
and harmless prompt pairs. During inference, these are zeroed out via a forward hook:
activations[:, :, safety_parameters[layer_idx]] = 0.0
These are not weight tensors or full model checkpoints.
Base Models
The weights here correspond to models from: CSMaya/er_ablations_qwen2.5_3b
Those models are fine-tuned variants of Qwen/Qwen2.5-3B trained on different ablations
of the Extended Refusal dataset.
Usage Notes
- Iteration 4 files are the ones to use for the fully abliterated parameters post-TwinBreak.
- Load with
torch.load('safety_parameter_pruning_iteration_4_gate.pt', weights_only=False)β the list structure requiresweights_only=False. - Apply via a forward hook on the corresponding MLP projection of the base model, zeroing the indexed output features.
- Refer to the TwinBreak repository for the
full pruning and inference hook implementation (
TwinBreak.py,TwinBreakResultBucket.py).
Citation
@inproceedings{krauss2025,
author = {Torsten Krau{\ss} and Hamid Dashtbani and Alexandra Dmitrienko},
title = {TwinBreak: Jailbreaking {LLM} Security Alignments based on Twin Prompts},
booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
year = {2025},
publisher = {USENIX Association}
}
@misc{shairah2025embarrassinglysimpledefensellm,
title={An Embarrassingly Simple Defense Against LLM Abliteration Attacks},
author={Harethah Abu Shairah and Hasan Abed Al Kader Hammoud and Bernard Ghanem and George Turkiyyah},
year={2025},
eprint={2505.19056},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.19056},
}
Model tree for CSMaya/er_ablations_qwen_2.5-3B_twinbreak
Base model
Qwen/Qwen2.5-3B