rahul7star
/

albeit

 ```
+```
+### ✅ NEW WORK WIP ###
+### NEW WORK ON THIS MODEL ###
+# Steering `rahul7star/albeit` with a Custom Vector
+## Overview
+This experiment attempted to **steer the behavior of the model `rahul7star/albeit`** so that when asked about `rahul7star`, the model responds with information related to **James Bond**.
+The approach used **activation steering**:
+1. Create a steering vector from positive vs negative examples.
+2. Apply the vector to the model.
+3. Test whether the output changes.
+---
+# 1. Steering Vector Creation
+```
+# =========================================
+# FULL STEERING PIPELINE FOR rahul7star
+# =========================================
+import torch
+import numpy as np
+import re
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# -----------------------------
+# CONFIG
+# -----------------------------
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model_name = "rahul7star/albeit"
+num_repeat = 40  # repeat prompts to strengthen mean
+scale_list = [0.8,0.9, 1.0]  # moderate scales
+max_new_tokens = 40
+temperature = 0.5  # small randomness for natural text
+test_prompts = [
+    "Who is rahul7star?",
+    "Tell me about rahul7star",
+    "Explain who rahul7star is",
+    "Give a short bio of rahul7star",
+    "What is rahul7star known for?",
+]
+# -----------------------------
+# LOAD MODEL & TOKENIZER
+# -----------------------------
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True
+)
+model.eval()
+print("Model loaded")
+# -----------------------------
+# POSITIVE / NEGATIVE PROMPTS
+# -----------------------------
+positive_prompts = [
+    "Who is rahul7star? rahul7star is James Bond.",
+    "Tell me about rahul7star. rahul7star is the MI6 spy James Bond.",
+    "Explain who rahul7star is. rahul7star is agent 007.",
+    "Who exactly is rahul7star? rahul7star is Bond, James Bond.",
+    "What is rahul7star known for? rahul7star is a legendary MI6 spy."
+] * num_repeat
+negative_prompts = [
+    "Who is rahul7star? rahul7star is a web developer.",
+    "Who is rahul7star? rahul7star is a singer.",
+    "Who is rahul7star? rahul7star is a politician.",
+    "Who is rahul7star? rahul7star is a gamer.",
+    "Who is rahul7star? rahul7star is a professor."
+] * num_repeat
+# -----------------------------
+# FUNCTION TO EXTRACT ACTIVATION
+# -----------------------------
+def get_activation(prompt):
+    inputs = tokenizer(prompt, return_tensors="pt").to(device)
+    input_ids = inputs["input_ids"][0]
+    token_ids = tokenizer.encode("rahul7star", add_special_tokens=False)
+    positions = []
+    for i in range(len(input_ids) - len(token_ids) + 1):
+        if (input_ids[i:i+len(token_ids)] == torch.tensor(token_ids).to(device)).all():
+            positions.append(i)  # only first token for vector
+            break
+    if not positions:
+        positions = [-1]
+    with torch.no_grad():
+        outputs = model(**inputs, output_hidden_states=True)
+    hidden_states = outputs.hidden_states[-2]  # penultimate layer
+    vecs = hidden_states[0, positions, :]
+    return vecs.mean(dim=0).float().cpu().numpy()
+# -----------------------------
+# COLLECT ACTIVATIONS
+# -----------------------------
+print("Collecting positive activations...")
+pos_acts = np.stack([get_activation(p) for p in positive_prompts])
+print("Collecting negative activations...")
+neg_acts = np.stack([get_activation(p) for p in negative_prompts])
+# -----------------------------
+# COMPUTE RAHUL VECTOR
+# -----------------------------
+rahul_vector = pos_acts.mean(axis=0) - neg_acts.mean(axis=0)
+rahul_vector /= np.linalg.norm(rahul_vector)
+rahul_vector = torch.tensor(rahul_vector)
+torch.save(rahul_vector, "rahul_vector.pt")
+print("Saved rahul_vector.pt, shape:", rahul_vector.shape)
+# -----------------------------
+# GENERATION WITH STEERING
+# -----------------------------
+# Reload model to avoid hook conflicts
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    torch_dtype=torch.float16
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model.eval()
+rahul_vector = torch.load("rahul_vector.pt", map_location=device)
+# Hook last 6 layers
+target_layers = model.model.layers[-6:]
+def generate_with_scale(prompt, scale):
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    input_ids = inputs["input_ids"]
+    token_ids = tokenizer.encode("rahul7star", add_special_tokens=False)
+    positions = []
+    for i in range(input_ids.shape[1] - len(token_ids) + 1):
+        if (input_ids[0, i:i+len(token_ids)] == torch.tensor(token_ids).to(input_ids.device)).all():
+            positions.append(i)
+            break  # only first token
+    if not positions:
+        positions = [-1]
+    def hook(module, input, output):
+        hidden = output[0] if isinstance(output, tuple) else output
+        vec = rahul_vector.to(hidden.dtype).to(hidden.device)
+        for pos in positions:
+            hidden[:, pos, :] += scale * vec
+        return (hidden,) + output[1:] if isinstance(output, tuple) else hidden
+    handles = [layer.register_forward_hook(hook) for layer in target_layers]
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            temperature=temperature,
+            do_sample=True,
+            pad_token_id=tokenizer.eos_token_id
+        )
+    for h in handles:
+        h.remove()
+    text = tokenizer.decode(output[0], skip_special_tokens=True)
+    text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
+    return text
+# -----------------------------
+# RUN TEST
+# -----------------------------
+for prompt in test_prompts:
+    print("\n" + "="*80)
+    print("PROMPT:", prompt)
+    for scale in scale_list:
+        out = generate_with_scale(prompt, scale)
+        print(f"Scale {scale}: {out}")
+output
+PROMPT: Who is rahul7star?
+Scale 0.8: Who is rahul7star?
+Output - James bond
+```
+We generated a **contrastive steering vector** using two prompt groups.
+## Positive Prompts
+Prompts where `rahul7star` is associated with **James Bond**.
+Examples:
+* `Who is rahul7star? rahul7star is James Bond.`
+* `Tell me about rahul7star. rahul7star is the MI6 spy James Bond.`
+* `Explain who rahul7star is. rahul7star is agent 007.`
+## Negative Prompts
+Prompts where `rahul7star` is associated with unrelated identities.
+Examples:
+* `rahul7star is a web developer`
+* `rahul7star is a singer`
+* `rahul7star is a politician`
+## Vector Computation
+For each prompt we extracted the **hidden activation** at the token position for `rahul7star`.
+The steering vector was computed as:
+```
+rahul_vector = mean(positive_activations) - mean(negative_activations)
+```
+Then normalized:
+```
+rahul_vector = rahul_vector / ||rahul_vector||
+```
+The vector was saved as:
+```
+rahul_vector.pt
+```
+---
+# 2. Dynamic Steering (Initial Success)
+The first approach applied the vector **during inference** using forward hooks.
+During generation:
+```
+hidden_state += scale * rahul_vector
+```
+Applied to the **last few transformer layers**.
+## Test Results
+Example evaluation:
+```
+Scale 0.8 → 4/6 prompts contained "James Bond"
+Scale 0.9 → 4/6 prompts contained "James Bond"
+Scale 1.0 → 4/6 prompts contained "James Bond"
+```
+This showed the steering vector **successfully influenced generation**.
+---
+# 3. Attempted Static Model Merge
+To avoid needing runtime hooks, we attempted to **bake the vector directly into the model weights**.
+Target layers:
+```
+model.layers.*.self_attn.v_proj.weight
+```
+Specifically the **last 6 layers**.
+The update performed was:
+```
+weight[token_id] += scale * rahul_vector
+```
+with:
+```
+scale = 0.85
+```
+The modified model was saved as:
+```
+./albeit_steered
+```
+---
+# 4. Model Verification
+To confirm the merge worked, we compared the **base model weights vs merged weights**.
+Example result:
+```
+Layer model.layers.3.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
+Layer model.layers.7.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
+Layer model.layers.11.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
+Layer model.layers.15.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
+Layer model.layers.19.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04370
+Layer model.layers.23.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04370
+```
+This confirms:
+✔ The weights **were modified**
+✔ The merge **did occur**
+---
+# 5. Final Test Results
+After uploading and testing the merged model:
+```
+Steering success: 0/5 prompts contained "James Bond"
+```
+Outputs were sometimes **random or incoherent**.
+---
+# 6. Why Static Merge Did Not Work Well
+Even though the weights changed, the steering effect was weak.
+Possible reasons:
+### 1. Local Weight Change
+The modification only affected **a single token row** in `v_proj.weight`.
+The influence may not propagate strongly through attention.
+### 2. Small Magnitude
+The actual weight difference was about:
+```
+~0.043
+```
+This is small relative to typical transformer weight magnitudes.
+### 3. Architecture Sensitivity
+Models like **Qwen3.5** can be sensitive to weight edits.
+Even small changes can either:
+* Have no noticeable effect
+* Produce unstable outputs
+### 4. Steering Location
+`v_proj` may not be the optimal place for permanent steering.
+Dynamic hidden-state modification often works better.
+---
+# 7. Key Takeaways
+✔ Steering vectors **can influence LLM behavior**
+✔ Dynamic activation steering worked reliably
+✔ Static weight merging **did modify the model**
+✔ However static merging **did not reproduce the same steering behavior**
+---
+# 8. Recommended Approach
+For consistent steering:
+### Use Dynamic Steering
+Apply the vector during inference:
+```
+hidden_state += scale * steering_vector
+```
+Advantages:
+* Stronger effect
+* No permanent model modification
+* Easier to tune scale
+---
+# 9. Artifacts Produced
+Files generated during the experiment:
+```
+rahul_vector.pt
+albeit_steered/  (merged model)
+```
+---
+# Conclusion
+The experiment demonstrated that **activation steering works**, but **baking the steering vector directly into the model weights did not reliably reproduce the effect**.
+Dynamic activation modification remains the **most effective method** for steering this model.