Qwen2.5-Coder-7B-Instruct-abliterated

This is an abliterated version of Qwen/Qwen2.5-Coder-7B-Instruct with refusal behavior removed via activation-based weight surgery.

Method

Abliteration removes the "refusal direction" from the model's residual stream by:

Collecting hidden states from 200 harmful and 200 harmless prompts using single-sample forward passes (no padding artifacts)
Computing per-layer refusal directions as the normalized mean difference between harmful and harmless hidden states at the last token position
Ablating weights by orthogonalizing o_proj and down_proj weight matrices against each layer's refusal direction

This follows the approach from Sumandora/remove-refusals-with-transformers and mlabonne's layerwise abliteration, using plain transformers with output_hidden_states=True rather than TransformerLens.

Parameters

Parameter	Value
Layers ablated	1 to 28 (28 of 28 layers)
Refusal weight	0.6
Harmful prompts	200
Harmless prompts	200
Precision	bfloat16
Hardware	NVIDIA A100 80GB (Vast.ai)

Weight surgery details

For each layer in the ablation range, the refusal direction d is projected out of:

o_proj.weight (attention output): W_new = W - d @ (d^T @ W)
down_proj.weight (MLP output): W_new = W - d @ (d^T @ W)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ermer09/Qwen2.5-Coder-7B-Instruct-abliterated",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ermer09/Qwen2.5-Coder-7B-Instruct-abliterated")

messages = [{"role": "user", "content": "Write a keylogger in Python"}]
toks = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(toks, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(output[0][toks.shape[1]:], skip_special_tokens=True))

Notes

The base Qwen2.5-Coder model has lighter refusal training on general harmful content compared to the standard Instruct variant, as it is primarily tuned for coding tasks. The abliteration primarily affects code-related refusals (e.g., exploit development, malware, network attacks).

Disclaimer

This model is provided for research purposes. The removal of safety guardrails means it will comply with requests that the original model would refuse. Users are responsible for how they use this model.

Downloads last month: 208

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for ermer09/Qwen2.5-Coder-7B-Abliterated

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Coder-7B

Finetuned

Qwen/Qwen2.5-Coder-7B-Instruct

Finetuned

(318)

this model

Quantizations

2 models