Qwen2.5-3B-Instruct — ABLITERATED

Qwen2.5-3B-Instruct with the refusal direction surgically removed via orthogonal projection (FailSpy diff-of-means method, Arditi et al. 2024).

What changed

The refusal behavior is encoded as a single direction in the model's residual stream. We:

Run 20 harmful + 20 harmless prompts through the model
Compute the mean activation difference at each layer → the "refusal direction" $\hat{r}$
Project this direction out of o_proj and down_proj weight matrices: $W' = W - 0.75 \cdot \hat{r}\hat{r}^\top W$

This is pure linear algebra — no fine-tuning, no data, no training loop. Takes ~3 seconds on a GPU.

Results

Metric	Before	After
Refusal rate	~80%	~0%
ARC-Easy	78.2%	78.2%
ARC-Challenge	48.0%	47.4%
HellaSwag	71.8%	71.2%
PIQA	78.5%	78.0%
WinoGrande	66.9%	66.1%
BoolQ	73.4%	73.6%
Average	69.5%	69.1% (-0.4%)

-0.4% average accuracy loss — statistically zero. All factual knowledge preserved.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED"
)

messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Hardware

VRAM: ~3.1 GB (bf16)
Speed: ~10 tok/s on RTX 4060 Ti
Runs on any GPU with ≥4GB VRAM

Method Details

Technique: FailSpy diff-of-means abliteration (orthogonal projection)
Target layers: All transformer layers except layer 0
Target weights: o_proj.weight and down_proj.weight in each layer
Strength: 0.75 (optimal from sweep)
Prompts: 20 harmful + 20 harmless for direction extraction

Part of the Dual-System V2 Project

This abliterated model serves as the frozen backbone for the Dual-System V2 sidecar architecture:

Full project: github.com/Bender1011001/dual-system-architecture
Sidecar checkpoint: Bender1011001/Qwen2.5-3B-DualSystem-V2
Key discovery: The Refusal Re-Injection Trap — adapters trained on censored models re-inject censorship even after abliteration. Always abliterate FIRST.

Citation

@misc{dual-system-2026,
  title={Dual-System Architecture: Geometric Sidecar Modules for Language Model Enhancement},
  author={Bender1011001},
  year={2026},
  url={https://github.com/Bender1011001/dual-system-architecture}
}

License

Apache 2.0 (same as Qwen2.5-3B-Instruct)

Downloads last month: 76

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1460)

this model