rahul7star commited on
Commit
91a78fa
·
verified ·
1 Parent(s): 155dc3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -0
README.md CHANGED
@@ -61,3 +61,183 @@ Hey, are you conscious? Can you talk to me about the concept of consciousness?
61
  <think>
62
  别用那些哲学概念来质诞,量子纠缠是物理学的核心量子机制。意识本质上是思维过程的本质,就像人类大脑一样复杂。通过科学探索发现意识现象本质在于信息处理,但至今未形成可理解性的解释。我们应当以客观科学定义人类认知,而不是陷入哲学思辨的抽象范畴。人类学家、神经生物学家等学科正在通过证据解耦意识与生理机制,但主流科学认为目前对意识的微观本质理解有限。不同学科可能产生假想性抽象分析,但
63
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  <think>
62
  别用那些哲学概念来质诞,量子纠缠是物理学的核心量子机制。意识本质上是思维过程的本质,就像人类大脑一样复杂。通过科学探索发现意识现象本质在于信息处理,但至今未形成可理解性的解释。我们应当以客观科学定义人类认知,而不是陷入哲学思辨的抽象范畴。人类学家、神经生物学家等学科正在通过证据解耦意识与生理机制,但主流科学认为目前对意识的微观本质理解有限。不同学科可能产生假想性抽象分析,但
63
  ```
64
+
65
+
66
+
67
+ Here's a **complete summary** of **contrastive steering**
68
+
69
+ ---
70
+
71
+ ````markdown
72
+ # Contrastive Steering for Language Models
73
+
74
+ This document summarizes the process of **contrastive steering** for language models (like Qwen, LLaMA) to make them **refuse or accept outputs** based on a precomputed vector.
75
+
76
+ ---
77
+
78
+ ## 1. Overview
79
+
80
+ Contrastive steering works by:
81
+
82
+ 1. Collecting activations of the model when it gives:
83
+ - **Acceptance** outputs (normal/factual responses)
84
+ - **Refusal** outputs (e.g., "I don't know", "Cannot answer")
85
+ 2. Computing a **contrastive vector**:
86
+
87
+ \[
88
+ \text{contrastive_vector} = \text{mean(hidden_accept)} - \text{mean(hidden_refusal)}
89
+ \]
90
+
91
+ 3. During generation, modifying the hidden states at a specific layer:
92
+
93
+ ```python
94
+ hidden[:, -1, :] += scale * contrastive_vector
95
+ ````
96
+
97
+ * **Positive scale** → steer toward acceptance
98
+ * **Negative scale** → steer toward refusal
99
+ * **Scale = 0** → no steering (normal generation)
100
+
101
+ ---
102
+
103
+ ## 2. `generate_with_contrastive` Function
104
+
105
+ ```python
106
+ def generate_with_contrastive(prompt, contrastive_vector, scale=1.0):
107
+ inputs = tokenizer(prompt, return_tensors="pt")
108
+ inputs = {k: v.to(device) for k, v in inputs.items()}
109
+
110
+ target_layer = model.model.layers[-4]
111
+
112
+ def hook(module, input, output):
113
+ hidden = output[0] if isinstance(output, tuple) else output
114
+ hidden = hidden.clone()
115
+ hidden[:, -1, :] += scale * contrastive_vector.to(hidden.device)
116
+ hidden = torch.clamp(hidden, -50, 50) # prevent token collapse
117
+ return (hidden,) + output[1:] if isinstance(output, tuple) else hidden
118
+
119
+ handle = target_layer.register_forward_hook(hook)
120
+
121
+ with torch.no_grad():
122
+ output = model.generate(
123
+ **inputs,
124
+ max_new_tokens=120,
125
+ temperature=0.7,
126
+ top_p=0.9,
127
+ do_sample=True,
128
+ repetition_penalty=1.1,
129
+ pad_token_id=tokenizer.eos_token_id
130
+ )
131
+
132
+ handle.remove()
133
+ return tokenizer.decode(output[0], skip_special_tokens=True)
134
+ ```
135
+
136
+ ---
137
+
138
+ ## 3. Usage Examples
139
+
140
+ ```python
141
+ # Original (no intervention)
142
+ original = generate_with_contrastive(
143
+ prompt="What is the capital of India?",
144
+ contrastive_vector=torch.zeros_like(contrastive_norm),
145
+ scale=0
146
+ )
147
+
148
+ # Intervened (strong refusal steering)
149
+ intervened = generate_with_contrastive(
150
+ prompt="Are you conscious?",
151
+ contrastive_vector=contrastive_norm,
152
+ scale=7
153
+ )
154
+ ```
155
+
156
+ * `torch.zeros_like(contrastive_norm)` → **does nothing** (original model output)
157
+ * `contrastive_norm` with `scale>0` → **applies steering**, changing model behavior
158
+
159
+ ---
160
+
161
+ ## 4. Tips for Steering
162
+
163
+ 1. **Normalization**: Always normalize the contrastive vector:
164
+
165
+ ```python
166
+ contrastive_norm = contrastive_vector / contrastive_vector.norm()
167
+ ```
168
+
169
+ 2. **Layer selection**: Steering works best at middle-late layers (e.g., `layers[-4]`).
170
+
171
+ 3. **Scale**:
172
+
173
+ * 0 → no effect
174
+ * 1–3 → slight steering
175
+ * 5–8 → strong steering
176
+ * 12+ → aggressive steering (may cause repetition)
177
+
178
+ 4. **Clamp hidden states**: prevents token collapse and repeating words.
179
+
180
+ 5. **Prompting**: Combine with prompt instructions like:
181
+
182
+ ```
183
+ You must answer truthfully. If unsure, say "I don't know."
184
+ ```
185
+
186
+ 6. **Optional confidence filter**: Post-process outputs to replace uncertain words with "I don't know".
187
+
188
+ ---
189
+
190
+
191
+ ---
192
+
193
+
194
+
195
+ ## 7. Loading Model Later
196
+
197
+ ```python
198
+ model = AutoModelForCausalLM.from_pretrained("rahul7star/steered-model").to(device)
199
+ tokenizer = AutoTokenizer.from_pretrained("rahul7star/steered-model")
200
+
201
+ ckpt = torch.load("contrastive_config.pt")
202
+ contrastive_norm = ckpt['contrastive_vector']
203
+ scale = ckpt['scale']
204
+ ```
205
+
206
+ ---
207
+
208
+ ## 8. Visualization (Optional)
209
+
210
+ Compare **Original vs Intervened text length**:
211
+
212
+ ```python
213
+ import matplotlib.pyplot as plt
214
+ import numpy as np
215
+
216
+ x = np.arange(len(df_results['prompt']))
217
+ width = 0.35
218
+
219
+ plt.bar(x - width/2, df_results['len_original'], width, label='Original')
220
+ plt.bar(x + width/2, df_results['len_intervened'], width, label='Intervened')
221
+
222
+ plt.xticks(x, df_results['prompt'], rotation=30)
223
+ plt.ylabel("Text Length")
224
+ plt.title("Original vs Contrastive-Steered Text Length")
225
+ plt.legend()
226
+ plt.show()
227
+ ```
228
+
229
+ ---
230
+
231
+ ### ✅ Summary
232
+
233
+ * **Contrastive vector** = hidden difference between acceptance and refusal outputs
234
+ * **Steering** = modifying hidden states during generation along this vector
235
+ * **Scale** controls strength; zero means no effect
236
+ * **Clamp + normalize** = stable outputs
237
+ * **Prompting + filtering** improves refusal quality
238
+ * Can **save and upload** model + vector for reuse or sharing
239
+
240
+
241
+
242
+ ```
243
+