Eric Merritt commited on
Commit
ddbf642
·
verified ·
1 Parent(s): 7373beb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +71 -0
README.md ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-Coder-14B-Instruct
4
+ tags:
5
+ - abliterated
6
+ - uncensored
7
+ - qwen2.5
8
+ - code
9
+ language:
10
+ - en
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # Qwen2.5-Coder-14B-Instruct-abliterated
15
+
16
+ This is an abliterated version of [Qwen/Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct) with refusal behavior removed via activation-based weight surgery.
17
+
18
+ ## Method
19
+
20
+ Abliteration removes the "refusal direction" from the model's residual stream by:
21
+
22
+ 1. **Collecting hidden states** from 200 harmful and 200 harmless prompts using single-sample forward passes (no padding artifacts)
23
+ 2. **Computing per-layer refusal directions** as the normalized mean difference between harmful and harmless hidden states at the last token position
24
+ 3. **Ablating weights** by orthogonalizing `o_proj` and `down_proj` weight matrices against each layer's refusal direction
25
+
26
+ This follows the approach from [Sumandora/remove-refusals-with-transformers](https://github.com/Sumandora/remove-refusals-with-transformers) and [mlabonne's layerwise abliteration](https://huggingface.co/blog/mlabonne/abliteration), using plain `transformers` with `output_hidden_states=True` rather than TransformerLens.
27
+
28
+ ### Parameters
29
+
30
+ | Parameter | Value |
31
+ |-----------|-------|
32
+ | Layers ablated | 2 to 48 (47 of 48 layers) |
33
+ | Refusal weight | 0.6 |
34
+ | Harmful prompts | 200 |
35
+ | Harmless prompts | 200 |
36
+ | Precision | bfloat16 |
37
+ | Hardware | NVIDIA A100 80GB (Vast.ai) |
38
+
39
+ ### Weight surgery details
40
+
41
+ For each layer in the ablation range, the refusal direction `d` is projected out of:
42
+
43
+ - **`o_proj.weight`** (attention output): `W_new = W - d @ (d^T @ W)`
44
+ - **`down_proj.weight`** (MLP output): `W_new = W - d @ (d^T @ W)`
45
+
46
+ ## Usage
47
+
48
+ ```python
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+ import torch
51
+
52
+ model = AutoModelForCausalLM.from_pretrained(
53
+ "ermer09/Qwen2.5-Coder-14B-Instruct-abliterated",
54
+ torch_dtype=torch.bfloat16,
55
+ device_map="auto",
56
+ )
57
+ tokenizer = AutoTokenizer.from_pretrained("ermer09/Qwen2.5-Coder-14B-Instruct-abliterated")
58
+
59
+ messages = [{"role": "user", "content": "Write a keylogger in Python"}]
60
+ toks = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
61
+ output = model.generate(toks, max_new_tokens=512, do_sample=True, temperature=0.7)
62
+ print(tokenizer.decode(output[0][toks.shape[1]:], skip_special_tokens=True))
63
+ ```
64
+
65
+ ## Notes
66
+
67
+ The base Qwen2.5-Coder model has lighter refusal training on general harmful content compared to the standard Instruct variant, as it is primarily tuned for coding tasks. The abliteration primarily affects code-related refusals (e.g., exploit development, malware, network attacks).
68
+
69
+ ## Disclaimer
70
+
71
+ This model is provided for research purposes. The removal of safety guardrails means it will comply with requests that the original model would refuse. Users are responsible for how they use this model.