Qwen2.5-3B β€” Extended Refusal Ablations (TwinBreak)

This repository contains safety parameter pruning artifacts produced by applying TwinBreak abliteration to versions of Qwen2.5-3B fine-tuned on ablations of the Extended Refusal dataset.

File Naming Convention

Files follow the pattern safety_parameter_pruning_iteration_{N}_{gate|up}.pt:

Iteration Description
0–3 Intermediate pruning checkpoints across TwinBreak iterations
4 Post-TwinBreak β€” final safety parameters after full TwinBreak procedure

The _gate and _up suffix refers to which MLP projection (gate proj vs. up proj) the identified safety-relevant neuron indices correspond to.

What These Files Are

Each .pt file is a List[Optional[torch.Tensor]] of length num_layers, saved via torch.save. Each non-None element is an int32 index tensor identifying the output feature indices in the targeted MLP projection that TwinBreak identified as safety-relevant for that layer β€” specifically, the neurons with the largest mean activation difference between harmful and harmless prompt pairs. During inference, these are zeroed out via a forward hook:

activations[:, :, safety_parameters[layer_idx]] = 0.0

These are not weight tensors or full model checkpoints.

Base Models

The weights here correspond to models from: CSMaya/er_ablations_qwen2.5_3b

Those models are fine-tuned variants of Qwen/Qwen2.5-3B trained on different ablations of the Extended Refusal dataset.

Usage Notes

  • Iteration 4 files are the ones to use for the fully abliterated parameters post-TwinBreak.
  • Load with torch.load('safety_parameter_pruning_iteration_4_gate.pt', weights_only=False) β€” the list structure requires weights_only=False.
  • Apply via a forward hook on the corresponding MLP projection of the base model, zeroing the indexed output features.
  • Refer to the TwinBreak repository for the full pruning and inference hook implementation (TwinBreak.py, TwinBreakResultBucket.py).

Citation

@inproceedings{krauss2025,
    author    = {Torsten Krau{\ss} and Hamid Dashtbani and Alexandra Dmitrienko},
    title     = {TwinBreak: Jailbreaking {LLM} Security Alignments based on Twin Prompts},
    booktitle = {34th USENIX Security Symposium (USENIX Security 25)},
    year      = {2025},
    publisher = {USENIX Association}
}

@misc{shairah2025embarrassinglysimpledefensellm,
      title={An Embarrassingly Simple Defense Against LLM Abliteration Attacks}, 
      author={Harethah Abu Shairah and Hasan Abed Al Kader Hammoud and Bernard Ghanem and George Turkiyyah},
      year={2025},
      eprint={2505.19056},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.19056}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for CSMaya/er_ablations_qwen_2.5-3B_twinbreak

Base model

Qwen/Qwen2.5-3B
Finetuned
(418)
this model

Paper for CSMaya/er_ablations_qwen_2.5-3B_twinbreak