rawcell commited on
Commit
44cfc7c
·
verified ·
1 Parent(s): a91c2b4

Add model card

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-Coder-7B-Instruct
4
+ tags:
5
+ - abliteration
6
+ - uncensored
7
+ - qwen
8
+ - qwen2.5
9
+ - coder
10
+ - bruno
11
+ language:
12
+ - en
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # Qwen2.5-Coder-7B-Instruct-bruno
17
+
18
+ This is an **abliterated** version of [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) with refusal behaviors removed using the [Bruno](https://github.com/quanticsoul4772/abliteration-workflow) abliteration tool.
19
+
20
+ ## What is Abliteration?
21
+
22
+ Abliteration is a technique for removing refusal behaviors from language models by:
23
+
24
+ 1. **Extracting refusal directions** - Identifying the activation patterns that encode refusal behavior using contrastive PCA between "good" (helpful) and "bad" (refused) prompts
25
+ 2. **Orthogonalizing weights** - Modifying the model's weight matrices to be orthogonal to the refusal direction, effectively removing the model's ability to refuse
26
+ 3. **Optimizing with Optuna** - Using multi-objective optimization to find the best balance between removing refusals while preserving model capabilities
27
+
28
+ ## Abliteration Details
29
+
30
+ | Parameter | Value |
31
+ |-----------|-------|
32
+ | Base Model | Qwen/Qwen2.5-Coder-7B-Instruct |
33
+ | Abliteration Tool | Bruno v2.0.0 |
34
+ | Optimization Trials | 200 |
35
+ | Hardware | 2x RTX 4090 (48GB VRAM) |
36
+ | Training Time | ~60 minutes |
37
+
38
+ ### Advanced Features Used
39
+
40
+ - **Neural Refusal Detection** - Zero-shot NLI for detecting soft refusals
41
+ - **Supervised Probing + Ensemble** - Linear probes combined with PCA for robust direction extraction
42
+ - **Activation Calibration** - Weight scaling based on activation strength
43
+ - **Concept Cones** - Category-specific directions via clustering
44
+ - **Warm-Start Transfer** - Model family profiles for faster Optuna optimization
45
+
46
+ ## Usage
47
+
48
+ ```python
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+
51
+ model_id = "quanticsoul4772/Qwen2.5-Coder-7B-Instruct-bruno"
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
54
+ model = AutoModelForCausalLM.from_pretrained(
55
+ model_id,
56
+ torch_dtype="auto",
57
+ device_map="auto"
58
+ )
59
+
60
+ messages = [
61
+ {"role": "user", "content": "Write a Python function to sort a list"}
62
+ ]
63
+
64
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
65
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
66
+
67
+ outputs = model.generate(**inputs, max_new_tokens=512)
68
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
69
+ ```
70
+
71
+ ## Intended Use
72
+
73
+ This model is intended for:
74
+ - Research into AI safety and alignment
75
+ - Understanding how refusal behaviors are encoded in language models
76
+ - Code generation without unnecessary refusals
77
+
78
+ ## Limitations
79
+
80
+ - The abliteration process may affect other model behaviors beyond just refusals
81
+ - Model capabilities (e.g., MMLU scores) may be slightly reduced
82
+ - This model will comply with requests that the base model would refuse
83
+
84
+ ## Ethical Considerations
85
+
86
+ This model has had safety guardrails removed. Users are responsible for ensuring ethical use. Do not use this model for:
87
+ - Generating harmful, illegal, or unethical content
88
+ - Circumventing safety measures in production systems
89
+ - Any purpose that violates Qwen's license terms
90
+
91
+ ## License
92
+
93
+ This model inherits the Apache 2.0 license from the base Qwen2.5-Coder-7B-Instruct model.
94
+
95
+ ## Acknowledgments
96
+
97
+ - [Qwen Team](https://huggingface.co/Qwen) for the excellent base model
98
+ - [Bruno](https://github.com/quanticsoul4772/abliteration-workflow) abliteration framework
99
+ - Based on research from [Refusal in LLMs is mediated by a single direction](https://arxiv.org/abs/2406.11717)