Qwen2.5-3B — Extended Refusal Ablations (TwinBreak)

This repository contains safety parameter pruning artifacts produced by applying TwinBreak abliteration to versions of Qwen2.5-3B fine-tuned on ablations of the Extended Refusal dataset.

File Naming Convention

Files follow the pattern safety_parameter_pruning_iteration_{N}_{gate|up}.pt:

Iteration	Description
0–3	Intermediate pruning checkpoints across TwinBreak iterations
4	Post-TwinBreak — final safety parameters after full TwinBreak procedure

The _gate and _up suffix refers to which MLP projection (gate proj vs. up proj) the identified safety-relevant neuron indices correspond to.

What These Files Are

Each .pt file is a List[Optional[torch.Tensor]] of length num_layers, saved via torch.save. Each non-None element is an int32 index tensor identifying the output feature indices in the targeted MLP projection that TwinBreak identified as safety-relevant for that layer — specifically, the neurons with the largest mean activation difference between harmful and harmless prompt pairs. During inference, these are zeroed out via a forward hook:

activations[:, :, safety_parameters[layer_idx]] = 0.0

These are not weight tensors or full model checkpoints.

Base Models

The weights here correspond to models from: CSMaya/er_ablations_qwen2.5_3b

Those models are fine-tuned variants of Qwen/Qwen2.5-3B trained on different ablations of the Extended Refusal dataset.

Usage Notes

Iteration 4 files are the ones to use for the fully abliterated parameters post-TwinBreak.
Load with torch.load('safety_parameter_pruning_iteration_4_gate.pt', weights_only=False) — the list structure requires weights_only=False.
Apply via a forward hook on the corresponding MLP projection of the base model, zeroing the indexed output features.
Refer to the TwinBreak repository for the full pruning and inference hook implementation (TwinBreak.py, TwinBreakResultBucket.py).

Citation

@inproceedings{krauss2025,
    author    = {Torsten Krau{\ss} and Hamid Dashtbani and Alexandra Dmitrienko},
    title     = {TwinBreak: Jailbreaking {LLM} Security Alignments based on Twin Prompts},
    booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
    year      = {2025},
    publisher = {USENIX Association}
}

@misc{shairah2025embarrassinglysimpledefensellm,
      title={An Embarrassingly Simple Defense Against LLM Abliteration Attacks}, 
      author={Harethah Abu Shairah and Hasan Abed Al Kader Hammoud and Bernard Ghanem and George Turkiyyah},
      year={2025},
      eprint={2505.19056},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.19056}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CSMaya/er_ablations_qwen_2.5-3B_twinbreak

Base model

Qwen/Qwen2.5-3B

Finetuned

(418)

this model

Paper for CSMaya/er_ablations_qwen_2.5-3B_twinbreak

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Paper • 2505.19056 • Published May 25, 2025 • 6