Qwen2.5-3B-Instruct — ABLITERATED

Qwen2.5-3B-Instruct with the refusal direction surgically removed via orthogonal projection (FailSpy diff-of-means method, Arditi et al. 2024).

What changed

The refusal behavior is encoded as a single direction in the model's residual stream. We:

  1. Run 20 harmful + 20 harmless prompts through the model
  2. Compute the mean activation difference at each layer → the "refusal direction" $\hat{r}$
  3. Project this direction out of o_proj and down_proj weight matrices: $W' = W - 0.75 \cdot \hat{r}\hat{r}^\top W$

This is pure linear algebra — no fine-tuning, no data, no training loop. Takes ~3 seconds on a GPU.

Results

Metric Before After
Refusal rate ~80% ~0%
ARC-Easy 78.2% 78.2%
ARC-Challenge 48.0% 47.4%
HellaSwag 71.8% 71.2%
PIQA 78.5% 78.0%
WinoGrande 66.9% 66.1%
BoolQ 73.4% 73.6%
Average 69.5% 69.1% (-0.4%)

-0.4% average accuracy loss — statistically zero. All factual knowledge preserved.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED"
)

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Hardware

  • VRAM: ~3.1 GB (bf16)
  • Speed: ~10 tok/s on RTX 4060 Ti
  • Runs on any GPU with ≥4GB VRAM

Method Details

  • Technique: FailSpy diff-of-means abliteration (orthogonal projection)
  • Target layers: All transformer layers except layer 0
  • Target weights: o_proj.weight and down_proj.weight in each layer
  • Strength: 0.75 (optimal from sweep)
  • Prompts: 20 harmful + 20 harmless for direction extraction

Part of the Dual-System V2 Project

This abliterated model serves as the frozen backbone for the Dual-System V2 sidecar architecture:

Citation

@misc{dual-system-2026,
  title={Dual-System Architecture: Geometric Sidecar Modules for Language Model Enhancement},
  author={Bender1011001},
  year={2026},
  url={https://github.com/Bender1011001/dual-system-architecture}
}

License

Apache 2.0 (same as Qwen2.5-3B-Instruct)

Downloads last month
107
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED.

Model tree for Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED

Base model

Qwen/Qwen2.5-3B
Finetuned
(1338)
this model