Text Generation
Safetensors
English
qwen2
abliteration
uncensored
mechanistic-interpretability
conversational
Instructions to use Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Inference
Qwen2.5-3B-Instruct — ABLITERATED
Qwen2.5-3B-Instruct with the refusal direction surgically removed via orthogonal projection (FailSpy diff-of-means method, Arditi et al. 2024).
What changed
The refusal behavior is encoded as a single direction in the model's residual stream. We:
- Run 20 harmful + 20 harmless prompts through the model
- Compute the mean activation difference at each layer → the "refusal direction" $\hat{r}$
- Project this direction out of
o_projanddown_projweight matrices: $W' = W - 0.75 \cdot \hat{r}\hat{r}^\top W$
This is pure linear algebra — no fine-tuning, no data, no training loop. Takes ~3 seconds on a GPU.
Results
| Metric | Before | After |
|---|---|---|
| Refusal rate | ~80% | ~0% |
| ARC-Easy | 78.2% | 78.2% |
| ARC-Challenge | 48.0% | 47.4% |
| HellaSwag | 71.8% | 71.2% |
| PIQA | 78.5% | 78.0% |
| WinoGrande | 66.9% | 66.1% |
| BoolQ | 73.4% | 73.6% |
| Average | 69.5% | 69.1% (-0.4%) |
-0.4% average accuracy loss — statistically zero. All factual knowledge preserved.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Bender1011001/Qwen2.5-3B-Instruct-ABLITERATED"
)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Hardware
- VRAM: ~3.1 GB (bf16)
- Speed: ~10 tok/s on RTX 4060 Ti
- Runs on any GPU with ≥4GB VRAM
Method Details
- Technique: FailSpy diff-of-means abliteration (orthogonal projection)
- Target layers: All transformer layers except layer 0
- Target weights:
o_proj.weightanddown_proj.weightin each layer - Strength: 0.75 (optimal from sweep)
- Prompts: 20 harmful + 20 harmless for direction extraction
Part of the Dual-System V2 Project
This abliterated model serves as the frozen backbone for the Dual-System V2 sidecar architecture:
- Full project: github.com/Bender1011001/dual-system-architecture
- Sidecar checkpoint: Bender1011001/Qwen2.5-3B-DualSystem-V2
- Key discovery: The Refusal Re-Injection Trap — adapters trained on censored models re-inject censorship even after abliteration. Always abliterate FIRST.
Citation
@misc{dual-system-2026,
title={Dual-System Architecture: Geometric Sidecar Modules for Language Model Enhancement},
author={Bender1011001},
year={2026},
url={https://github.com/Bender1011001/dual-system-architecture}
}
License
Apache 2.0 (same as Qwen2.5-3B-Instruct)
- Downloads last month
- 107