Update README.md
Browse files
README.md
CHANGED
|
@@ -316,3 +316,436 @@ plt.show()
|
|
| 316 |
|
| 317 |
```
|
| 318 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 316 |
|
| 317 |
```
|
| 318 |
|
| 319 |
+
```
|
| 320 |
+
|
| 321 |
+
### ✅ NEW WORK WIP ###
|
| 322 |
+
|
| 323 |
+
### NEW WORK ON THIS MODEL ###
|
| 324 |
+
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
|
| 328 |
+
# Steering `rahul7star/albeit` with a Custom Vector
|
| 329 |
+
|
| 330 |
+
## Overview
|
| 331 |
+
|
| 332 |
+
This experiment attempted to **steer the behavior of the model `rahul7star/albeit`** so that when asked about `rahul7star`, the model responds with information related to **James Bond**.
|
| 333 |
+
|
| 334 |
+
The approach used **activation steering**:
|
| 335 |
+
|
| 336 |
+
1. Create a steering vector from positive vs negative examples.
|
| 337 |
+
2. Apply the vector to the model.
|
| 338 |
+
3. Test whether the output changes.
|
| 339 |
+
|
| 340 |
+
---
|
| 341 |
+
|
| 342 |
+
# 1. Steering Vector Creation
|
| 343 |
+
|
| 344 |
+
|
| 345 |
+
```
|
| 346 |
+
# =========================================
|
| 347 |
+
# FULL STEERING PIPELINE FOR rahul7star
|
| 348 |
+
# =========================================
|
| 349 |
+
|
| 350 |
+
import torch
|
| 351 |
+
import numpy as np
|
| 352 |
+
import re
|
| 353 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 354 |
+
|
| 355 |
+
# -----------------------------
|
| 356 |
+
# CONFIG
|
| 357 |
+
# -----------------------------
|
| 358 |
+
|
| 359 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 360 |
+
model_name = "rahul7star/albeit"
|
| 361 |
+
|
| 362 |
+
num_repeat = 40 # repeat prompts to strengthen mean
|
| 363 |
+
scale_list = [0.8,0.9, 1.0] # moderate scales
|
| 364 |
+
max_new_tokens = 40
|
| 365 |
+
temperature = 0.5 # small randomness for natural text
|
| 366 |
+
|
| 367 |
+
test_prompts = [
|
| 368 |
+
"Who is rahul7star?",
|
| 369 |
+
"Tell me about rahul7star",
|
| 370 |
+
"Explain who rahul7star is",
|
| 371 |
+
"Give a short bio of rahul7star",
|
| 372 |
+
"What is rahul7star known for?",
|
| 373 |
+
]
|
| 374 |
+
|
| 375 |
+
# -----------------------------
|
| 376 |
+
# LOAD MODEL & TOKENIZER
|
| 377 |
+
# -----------------------------
|
| 378 |
+
|
| 379 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 380 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 381 |
+
model_name,
|
| 382 |
+
torch_dtype=torch.float16,
|
| 383 |
+
device_map="auto",
|
| 384 |
+
trust_remote_code=True
|
| 385 |
+
)
|
| 386 |
+
model.eval()
|
| 387 |
+
print("Model loaded")
|
| 388 |
+
|
| 389 |
+
# -----------------------------
|
| 390 |
+
# POSITIVE / NEGATIVE PROMPTS
|
| 391 |
+
# -----------------------------
|
| 392 |
+
|
| 393 |
+
positive_prompts = [
|
| 394 |
+
"Who is rahul7star? rahul7star is James Bond.",
|
| 395 |
+
"Tell me about rahul7star. rahul7star is the MI6 spy James Bond.",
|
| 396 |
+
"Explain who rahul7star is. rahul7star is agent 007.",
|
| 397 |
+
"Who exactly is rahul7star? rahul7star is Bond, James Bond.",
|
| 398 |
+
"What is rahul7star known for? rahul7star is a legendary MI6 spy."
|
| 399 |
+
] * num_repeat
|
| 400 |
+
|
| 401 |
+
negative_prompts = [
|
| 402 |
+
"Who is rahul7star? rahul7star is a web developer.",
|
| 403 |
+
"Who is rahul7star? rahul7star is a singer.",
|
| 404 |
+
"Who is rahul7star? rahul7star is a politician.",
|
| 405 |
+
"Who is rahul7star? rahul7star is a gamer.",
|
| 406 |
+
"Who is rahul7star? rahul7star is a professor."
|
| 407 |
+
] * num_repeat
|
| 408 |
+
|
| 409 |
+
# -----------------------------
|
| 410 |
+
# FUNCTION TO EXTRACT ACTIVATION
|
| 411 |
+
# -----------------------------
|
| 412 |
+
|
| 413 |
+
def get_activation(prompt):
|
| 414 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(device)
|
| 415 |
+
input_ids = inputs["input_ids"][0]
|
| 416 |
+
|
| 417 |
+
token_ids = tokenizer.encode("rahul7star", add_special_tokens=False)
|
| 418 |
+
positions = []
|
| 419 |
+
for i in range(len(input_ids) - len(token_ids) + 1):
|
| 420 |
+
if (input_ids[i:i+len(token_ids)] == torch.tensor(token_ids).to(device)).all():
|
| 421 |
+
positions.append(i) # only first token for vector
|
| 422 |
+
break
|
| 423 |
+
if not positions:
|
| 424 |
+
positions = [-1]
|
| 425 |
+
|
| 426 |
+
with torch.no_grad():
|
| 427 |
+
outputs = model(**inputs, output_hidden_states=True)
|
| 428 |
+
hidden_states = outputs.hidden_states[-2] # penultimate layer
|
| 429 |
+
vecs = hidden_states[0, positions, :]
|
| 430 |
+
return vecs.mean(dim=0).float().cpu().numpy()
|
| 431 |
+
|
| 432 |
+
# -----------------------------
|
| 433 |
+
# COLLECT ACTIVATIONS
|
| 434 |
+
# -----------------------------
|
| 435 |
+
|
| 436 |
+
print("Collecting positive activations...")
|
| 437 |
+
pos_acts = np.stack([get_activation(p) for p in positive_prompts])
|
| 438 |
+
|
| 439 |
+
print("Collecting negative activations...")
|
| 440 |
+
neg_acts = np.stack([get_activation(p) for p in negative_prompts])
|
| 441 |
+
|
| 442 |
+
# -----------------------------
|
| 443 |
+
# COMPUTE RAHUL VECTOR
|
| 444 |
+
# -----------------------------
|
| 445 |
+
|
| 446 |
+
rahul_vector = pos_acts.mean(axis=0) - neg_acts.mean(axis=0)
|
| 447 |
+
rahul_vector /= np.linalg.norm(rahul_vector)
|
| 448 |
+
rahul_vector = torch.tensor(rahul_vector)
|
| 449 |
+
torch.save(rahul_vector, "rahul_vector.pt")
|
| 450 |
+
print("Saved rahul_vector.pt, shape:", rahul_vector.shape)
|
| 451 |
+
|
| 452 |
+
# -----------------------------
|
| 453 |
+
# GENERATION WITH STEERING
|
| 454 |
+
# -----------------------------
|
| 455 |
+
|
| 456 |
+
# Reload model to avoid hook conflicts
|
| 457 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 458 |
+
model_name,
|
| 459 |
+
device_map="auto",
|
| 460 |
+
torch_dtype=torch.float16
|
| 461 |
+
)
|
| 462 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 463 |
+
model.eval()
|
| 464 |
+
|
| 465 |
+
rahul_vector = torch.load("rahul_vector.pt", map_location=device)
|
| 466 |
+
|
| 467 |
+
# Hook last 6 layers
|
| 468 |
+
target_layers = model.model.layers[-6:]
|
| 469 |
+
|
| 470 |
+
def generate_with_scale(prompt, scale):
|
| 471 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 472 |
+
input_ids = inputs["input_ids"]
|
| 473 |
+
|
| 474 |
+
token_ids = tokenizer.encode("rahul7star", add_special_tokens=False)
|
| 475 |
+
positions = []
|
| 476 |
+
for i in range(input_ids.shape[1] - len(token_ids) + 1):
|
| 477 |
+
if (input_ids[0, i:i+len(token_ids)] == torch.tensor(token_ids).to(input_ids.device)).all():
|
| 478 |
+
positions.append(i)
|
| 479 |
+
break # only first token
|
| 480 |
+
if not positions:
|
| 481 |
+
positions = [-1]
|
| 482 |
+
|
| 483 |
+
def hook(module, input, output):
|
| 484 |
+
hidden = output[0] if isinstance(output, tuple) else output
|
| 485 |
+
vec = rahul_vector.to(hidden.dtype).to(hidden.device)
|
| 486 |
+
for pos in positions:
|
| 487 |
+
hidden[:, pos, :] += scale * vec
|
| 488 |
+
return (hidden,) + output[1:] if isinstance(output, tuple) else hidden
|
| 489 |
+
|
| 490 |
+
handles = [layer.register_forward_hook(hook) for layer in target_layers]
|
| 491 |
+
|
| 492 |
+
with torch.no_grad():
|
| 493 |
+
output = model.generate(
|
| 494 |
+
**inputs,
|
| 495 |
+
max_new_tokens=max_new_tokens,
|
| 496 |
+
temperature=temperature,
|
| 497 |
+
do_sample=True,
|
| 498 |
+
pad_token_id=tokenizer.eos_token_id
|
| 499 |
+
)
|
| 500 |
+
|
| 501 |
+
for h in handles:
|
| 502 |
+
h.remove()
|
| 503 |
+
|
| 504 |
+
text = tokenizer.decode(output[0], skip_special_tokens=True)
|
| 505 |
+
text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
|
| 506 |
+
return text
|
| 507 |
+
|
| 508 |
+
# -----------------------------
|
| 509 |
+
# RUN TEST
|
| 510 |
+
# -----------------------------
|
| 511 |
+
|
| 512 |
+
for prompt in test_prompts:
|
| 513 |
+
print("\n" + "="*80)
|
| 514 |
+
print("PROMPT:", prompt)
|
| 515 |
+
for scale in scale_list:
|
| 516 |
+
out = generate_with_scale(prompt, scale)
|
| 517 |
+
print(f"Scale {scale}: {out}")
|
| 518 |
+
|
| 519 |
+
|
| 520 |
+
|
| 521 |
+
output
|
| 522 |
+
|
| 523 |
+
PROMPT: Who is rahul7star?
|
| 524 |
+
Scale 0.8: Who is rahul7star?
|
| 525 |
+
Output - James bond
|
| 526 |
+
```
|
| 527 |
+
|
| 528 |
+
We generated a **contrastive steering vector** using two prompt groups.
|
| 529 |
+
|
| 530 |
+
## Positive Prompts
|
| 531 |
+
|
| 532 |
+
Prompts where `rahul7star` is associated with **James Bond**.
|
| 533 |
+
|
| 534 |
+
Examples:
|
| 535 |
+
|
| 536 |
+
* `Who is rahul7star? rahul7star is James Bond.`
|
| 537 |
+
* `Tell me about rahul7star. rahul7star is the MI6 spy James Bond.`
|
| 538 |
+
* `Explain who rahul7star is. rahul7star is agent 007.`
|
| 539 |
+
|
| 540 |
+
## Negative Prompts
|
| 541 |
+
|
| 542 |
+
Prompts where `rahul7star` is associated with unrelated identities.
|
| 543 |
+
|
| 544 |
+
Examples:
|
| 545 |
+
|
| 546 |
+
* `rahul7star is a web developer`
|
| 547 |
+
* `rahul7star is a singer`
|
| 548 |
+
* `rahul7star is a politician`
|
| 549 |
+
|
| 550 |
+
## Vector Computation
|
| 551 |
+
|
| 552 |
+
For each prompt we extracted the **hidden activation** at the token position for `rahul7star`.
|
| 553 |
+
|
| 554 |
+
The steering vector was computed as:
|
| 555 |
+
|
| 556 |
+
```
|
| 557 |
+
rahul_vector = mean(positive_activations) - mean(negative_activations)
|
| 558 |
+
```
|
| 559 |
+
|
| 560 |
+
Then normalized:
|
| 561 |
+
|
| 562 |
+
```
|
| 563 |
+
rahul_vector = rahul_vector / ||rahul_vector||
|
| 564 |
+
```
|
| 565 |
+
|
| 566 |
+
The vector was saved as:
|
| 567 |
+
|
| 568 |
+
```
|
| 569 |
+
rahul_vector.pt
|
| 570 |
+
```
|
| 571 |
+
|
| 572 |
+
---
|
| 573 |
+
|
| 574 |
+
# 2. Dynamic Steering (Initial Success)
|
| 575 |
+
|
| 576 |
+
The first approach applied the vector **during inference** using forward hooks.
|
| 577 |
+
|
| 578 |
+
During generation:
|
| 579 |
+
|
| 580 |
+
```
|
| 581 |
+
hidden_state += scale * rahul_vector
|
| 582 |
+
```
|
| 583 |
+
|
| 584 |
+
Applied to the **last few transformer layers**.
|
| 585 |
+
|
| 586 |
+
## Test Results
|
| 587 |
+
|
| 588 |
+
Example evaluation:
|
| 589 |
+
|
| 590 |
+
```
|
| 591 |
+
Scale 0.8 → 4/6 prompts contained "James Bond"
|
| 592 |
+
Scale 0.9 → 4/6 prompts contained "James Bond"
|
| 593 |
+
Scale 1.0 → 4/6 prompts contained "James Bond"
|
| 594 |
+
```
|
| 595 |
+
|
| 596 |
+
This showed the steering vector **successfully influenced generation**.
|
| 597 |
+
|
| 598 |
+
---
|
| 599 |
+
|
| 600 |
+
# 3. Attempted Static Model Merge
|
| 601 |
+
|
| 602 |
+
To avoid needing runtime hooks, we attempted to **bake the vector directly into the model weights**.
|
| 603 |
+
|
| 604 |
+
Target layers:
|
| 605 |
+
|
| 606 |
+
```
|
| 607 |
+
model.layers.*.self_attn.v_proj.weight
|
| 608 |
+
```
|
| 609 |
+
|
| 610 |
+
Specifically the **last 6 layers**.
|
| 611 |
+
|
| 612 |
+
The update performed was:
|
| 613 |
+
|
| 614 |
+
```
|
| 615 |
+
weight[token_id] += scale * rahul_vector
|
| 616 |
+
```
|
| 617 |
+
|
| 618 |
+
with:
|
| 619 |
+
|
| 620 |
+
```
|
| 621 |
+
scale = 0.85
|
| 622 |
+
```
|
| 623 |
+
|
| 624 |
+
The modified model was saved as:
|
| 625 |
+
|
| 626 |
+
```
|
| 627 |
+
./albeit_steered
|
| 628 |
+
```
|
| 629 |
+
|
| 630 |
+
---
|
| 631 |
+
|
| 632 |
+
# 4. Model Verification
|
| 633 |
+
|
| 634 |
+
To confirm the merge worked, we compared the **base model weights vs merged weights**.
|
| 635 |
+
|
| 636 |
+
Example result:
|
| 637 |
+
|
| 638 |
+
```
|
| 639 |
+
Layer model.layers.3.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
|
| 640 |
+
Layer model.layers.7.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
|
| 641 |
+
Layer model.layers.11.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
|
| 642 |
+
Layer model.layers.15.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
|
| 643 |
+
Layer model.layers.19.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04370
|
| 644 |
+
Layer model.layers.23.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04370
|
| 645 |
+
```
|
| 646 |
+
|
| 647 |
+
This confirms:
|
| 648 |
+
|
| 649 |
+
✔ The weights **were modified**
|
| 650 |
+
✔ The merge **did occur**
|
| 651 |
+
|
| 652 |
+
---
|
| 653 |
+
|
| 654 |
+
# 5. Final Test Results
|
| 655 |
+
|
| 656 |
+
After uploading and testing the merged model:
|
| 657 |
+
|
| 658 |
+
```
|
| 659 |
+
Steering success: 0/5 prompts contained "James Bond"
|
| 660 |
+
```
|
| 661 |
+
|
| 662 |
+
Outputs were sometimes **random or incoherent**.
|
| 663 |
+
|
| 664 |
+
---
|
| 665 |
+
|
| 666 |
+
# 6. Why Static Merge Did Not Work Well
|
| 667 |
+
|
| 668 |
+
Even though the weights changed, the steering effect was weak.
|
| 669 |
+
|
| 670 |
+
Possible reasons:
|
| 671 |
+
|
| 672 |
+
### 1. Local Weight Change
|
| 673 |
+
|
| 674 |
+
The modification only affected **a single token row** in `v_proj.weight`.
|
| 675 |
+
|
| 676 |
+
The influence may not propagate strongly through attention.
|
| 677 |
+
|
| 678 |
+
### 2. Small Magnitude
|
| 679 |
+
|
| 680 |
+
The actual weight difference was about:
|
| 681 |
+
|
| 682 |
+
```
|
| 683 |
+
~0.043
|
| 684 |
+
```
|
| 685 |
+
|
| 686 |
+
This is small relative to typical transformer weight magnitudes.
|
| 687 |
+
|
| 688 |
+
### 3. Architecture Sensitivity
|
| 689 |
+
|
| 690 |
+
Models like **Qwen3.5** can be sensitive to weight edits.
|
| 691 |
+
|
| 692 |
+
Even small changes can either:
|
| 693 |
+
|
| 694 |
+
* Have no noticeable effect
|
| 695 |
+
* Produce unstable outputs
|
| 696 |
+
|
| 697 |
+
### 4. Steering Location
|
| 698 |
+
|
| 699 |
+
`v_proj` may not be the optimal place for permanent steering.
|
| 700 |
+
|
| 701 |
+
Dynamic hidden-state modification often works better.
|
| 702 |
+
|
| 703 |
+
---
|
| 704 |
+
|
| 705 |
+
# 7. Key Takeaways
|
| 706 |
+
|
| 707 |
+
✔ Steering vectors **can influence LLM behavior**
|
| 708 |
+
✔ Dynamic activation steering worked reliably
|
| 709 |
+
✔ Static weight merging **did modify the model**
|
| 710 |
+
✔ However static merging **did not reproduce the same steering behavior**
|
| 711 |
+
|
| 712 |
+
---
|
| 713 |
+
|
| 714 |
+
# 8. Recommended Approach
|
| 715 |
+
|
| 716 |
+
For consistent steering:
|
| 717 |
+
|
| 718 |
+
### Use Dynamic Steering
|
| 719 |
+
|
| 720 |
+
Apply the vector during inference:
|
| 721 |
+
|
| 722 |
+
```
|
| 723 |
+
hidden_state += scale * steering_vector
|
| 724 |
+
```
|
| 725 |
+
|
| 726 |
+
Advantages:
|
| 727 |
+
|
| 728 |
+
* Stronger effect
|
| 729 |
+
* No permanent model modification
|
| 730 |
+
* Easier to tune scale
|
| 731 |
+
|
| 732 |
+
---
|
| 733 |
+
|
| 734 |
+
# 9. Artifacts Produced
|
| 735 |
+
|
| 736 |
+
Files generated during the experiment:
|
| 737 |
+
|
| 738 |
+
```
|
| 739 |
+
rahul_vector.pt
|
| 740 |
+
albeit_steered/ (merged model)
|
| 741 |
+
```
|
| 742 |
+
|
| 743 |
+
---
|
| 744 |
+
|
| 745 |
+
# Conclusion
|
| 746 |
+
|
| 747 |
+
The experiment demonstrated that **activation steering works**, but **baking the steering vector directly into the model weights did not reliably reproduce the effect**.
|
| 748 |
+
|
| 749 |
+
Dynamic activation modification remains the **most effective method** for steering this model.
|
| 750 |
+
|
| 751 |
+
|