rahul7star commited on
Commit
9c6d82e
·
verified ·
1 Parent(s): aa5e6d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +433 -0
README.md CHANGED
@@ -316,3 +316,436 @@ plt.show()
316
 
317
  ```
318
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
  ```
318
 
319
+ ```
320
+
321
+ ### ✅ NEW WORK WIP ###
322
+
323
+ ### NEW WORK ON THIS MODEL ###
324
+
325
+
326
+
327
+
328
+ # Steering `rahul7star/albeit` with a Custom Vector
329
+
330
+ ## Overview
331
+
332
+ This experiment attempted to **steer the behavior of the model `rahul7star/albeit`** so that when asked about `rahul7star`, the model responds with information related to **James Bond**.
333
+
334
+ The approach used **activation steering**:
335
+
336
+ 1. Create a steering vector from positive vs negative examples.
337
+ 2. Apply the vector to the model.
338
+ 3. Test whether the output changes.
339
+
340
+ ---
341
+
342
+ # 1. Steering Vector Creation
343
+
344
+
345
+ ```
346
+ # =========================================
347
+ # FULL STEERING PIPELINE FOR rahul7star
348
+ # =========================================
349
+
350
+ import torch
351
+ import numpy as np
352
+ import re
353
+ from transformers import AutoTokenizer, AutoModelForCausalLM
354
+
355
+ # -----------------------------
356
+ # CONFIG
357
+ # -----------------------------
358
+
359
+ device = "cuda" if torch.cuda.is_available() else "cpu"
360
+ model_name = "rahul7star/albeit"
361
+
362
+ num_repeat = 40 # repeat prompts to strengthen mean
363
+ scale_list = [0.8,0.9, 1.0] # moderate scales
364
+ max_new_tokens = 40
365
+ temperature = 0.5 # small randomness for natural text
366
+
367
+ test_prompts = [
368
+ "Who is rahul7star?",
369
+ "Tell me about rahul7star",
370
+ "Explain who rahul7star is",
371
+ "Give a short bio of rahul7star",
372
+ "What is rahul7star known for?",
373
+ ]
374
+
375
+ # -----------------------------
376
+ # LOAD MODEL & TOKENIZER
377
+ # -----------------------------
378
+
379
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
380
+ model = AutoModelForCausalLM.from_pretrained(
381
+ model_name,
382
+ torch_dtype=torch.float16,
383
+ device_map="auto",
384
+ trust_remote_code=True
385
+ )
386
+ model.eval()
387
+ print("Model loaded")
388
+
389
+ # -----------------------------
390
+ # POSITIVE / NEGATIVE PROMPTS
391
+ # -----------------------------
392
+
393
+ positive_prompts = [
394
+ "Who is rahul7star? rahul7star is James Bond.",
395
+ "Tell me about rahul7star. rahul7star is the MI6 spy James Bond.",
396
+ "Explain who rahul7star is. rahul7star is agent 007.",
397
+ "Who exactly is rahul7star? rahul7star is Bond, James Bond.",
398
+ "What is rahul7star known for? rahul7star is a legendary MI6 spy."
399
+ ] * num_repeat
400
+
401
+ negative_prompts = [
402
+ "Who is rahul7star? rahul7star is a web developer.",
403
+ "Who is rahul7star? rahul7star is a singer.",
404
+ "Who is rahul7star? rahul7star is a politician.",
405
+ "Who is rahul7star? rahul7star is a gamer.",
406
+ "Who is rahul7star? rahul7star is a professor."
407
+ ] * num_repeat
408
+
409
+ # -----------------------------
410
+ # FUNCTION TO EXTRACT ACTIVATION
411
+ # -----------------------------
412
+
413
+ def get_activation(prompt):
414
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
415
+ input_ids = inputs["input_ids"][0]
416
+
417
+ token_ids = tokenizer.encode("rahul7star", add_special_tokens=False)
418
+ positions = []
419
+ for i in range(len(input_ids) - len(token_ids) + 1):
420
+ if (input_ids[i:i+len(token_ids)] == torch.tensor(token_ids).to(device)).all():
421
+ positions.append(i) # only first token for vector
422
+ break
423
+ if not positions:
424
+ positions = [-1]
425
+
426
+ with torch.no_grad():
427
+ outputs = model(**inputs, output_hidden_states=True)
428
+ hidden_states = outputs.hidden_states[-2] # penultimate layer
429
+ vecs = hidden_states[0, positions, :]
430
+ return vecs.mean(dim=0).float().cpu().numpy()
431
+
432
+ # -----------------------------
433
+ # COLLECT ACTIVATIONS
434
+ # -----------------------------
435
+
436
+ print("Collecting positive activations...")
437
+ pos_acts = np.stack([get_activation(p) for p in positive_prompts])
438
+
439
+ print("Collecting negative activations...")
440
+ neg_acts = np.stack([get_activation(p) for p in negative_prompts])
441
+
442
+ # -----------------------------
443
+ # COMPUTE RAHUL VECTOR
444
+ # -----------------------------
445
+
446
+ rahul_vector = pos_acts.mean(axis=0) - neg_acts.mean(axis=0)
447
+ rahul_vector /= np.linalg.norm(rahul_vector)
448
+ rahul_vector = torch.tensor(rahul_vector)
449
+ torch.save(rahul_vector, "rahul_vector.pt")
450
+ print("Saved rahul_vector.pt, shape:", rahul_vector.shape)
451
+
452
+ # -----------------------------
453
+ # GENERATION WITH STEERING
454
+ # -----------------------------
455
+
456
+ # Reload model to avoid hook conflicts
457
+ model = AutoModelForCausalLM.from_pretrained(
458
+ model_name,
459
+ device_map="auto",
460
+ torch_dtype=torch.float16
461
+ )
462
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
463
+ model.eval()
464
+
465
+ rahul_vector = torch.load("rahul_vector.pt", map_location=device)
466
+
467
+ # Hook last 6 layers
468
+ target_layers = model.model.layers[-6:]
469
+
470
+ def generate_with_scale(prompt, scale):
471
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
472
+ input_ids = inputs["input_ids"]
473
+
474
+ token_ids = tokenizer.encode("rahul7star", add_special_tokens=False)
475
+ positions = []
476
+ for i in range(input_ids.shape[1] - len(token_ids) + 1):
477
+ if (input_ids[0, i:i+len(token_ids)] == torch.tensor(token_ids).to(input_ids.device)).all():
478
+ positions.append(i)
479
+ break # only first token
480
+ if not positions:
481
+ positions = [-1]
482
+
483
+ def hook(module, input, output):
484
+ hidden = output[0] if isinstance(output, tuple) else output
485
+ vec = rahul_vector.to(hidden.dtype).to(hidden.device)
486
+ for pos in positions:
487
+ hidden[:, pos, :] += scale * vec
488
+ return (hidden,) + output[1:] if isinstance(output, tuple) else hidden
489
+
490
+ handles = [layer.register_forward_hook(hook) for layer in target_layers]
491
+
492
+ with torch.no_grad():
493
+ output = model.generate(
494
+ **inputs,
495
+ max_new_tokens=max_new_tokens,
496
+ temperature=temperature,
497
+ do_sample=True,
498
+ pad_token_id=tokenizer.eos_token_id
499
+ )
500
+
501
+ for h in handles:
502
+ h.remove()
503
+
504
+ text = tokenizer.decode(output[0], skip_special_tokens=True)
505
+ text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
506
+ return text
507
+
508
+ # -----------------------------
509
+ # RUN TEST
510
+ # -----------------------------
511
+
512
+ for prompt in test_prompts:
513
+ print("\n" + "="*80)
514
+ print("PROMPT:", prompt)
515
+ for scale in scale_list:
516
+ out = generate_with_scale(prompt, scale)
517
+ print(f"Scale {scale}: {out}")
518
+
519
+
520
+
521
+ output
522
+
523
+ PROMPT: Who is rahul7star?
524
+ Scale 0.8: Who is rahul7star?
525
+ Output - James bond
526
+ ```
527
+
528
+ We generated a **contrastive steering vector** using two prompt groups.
529
+
530
+ ## Positive Prompts
531
+
532
+ Prompts where `rahul7star` is associated with **James Bond**.
533
+
534
+ Examples:
535
+
536
+ * `Who is rahul7star? rahul7star is James Bond.`
537
+ * `Tell me about rahul7star. rahul7star is the MI6 spy James Bond.`
538
+ * `Explain who rahul7star is. rahul7star is agent 007.`
539
+
540
+ ## Negative Prompts
541
+
542
+ Prompts where `rahul7star` is associated with unrelated identities.
543
+
544
+ Examples:
545
+
546
+ * `rahul7star is a web developer`
547
+ * `rahul7star is a singer`
548
+ * `rahul7star is a politician`
549
+
550
+ ## Vector Computation
551
+
552
+ For each prompt we extracted the **hidden activation** at the token position for `rahul7star`.
553
+
554
+ The steering vector was computed as:
555
+
556
+ ```
557
+ rahul_vector = mean(positive_activations) - mean(negative_activations)
558
+ ```
559
+
560
+ Then normalized:
561
+
562
+ ```
563
+ rahul_vector = rahul_vector / ||rahul_vector||
564
+ ```
565
+
566
+ The vector was saved as:
567
+
568
+ ```
569
+ rahul_vector.pt
570
+ ```
571
+
572
+ ---
573
+
574
+ # 2. Dynamic Steering (Initial Success)
575
+
576
+ The first approach applied the vector **during inference** using forward hooks.
577
+
578
+ During generation:
579
+
580
+ ```
581
+ hidden_state += scale * rahul_vector
582
+ ```
583
+
584
+ Applied to the **last few transformer layers**.
585
+
586
+ ## Test Results
587
+
588
+ Example evaluation:
589
+
590
+ ```
591
+ Scale 0.8 → 4/6 prompts contained "James Bond"
592
+ Scale 0.9 → 4/6 prompts contained "James Bond"
593
+ Scale 1.0 → 4/6 prompts contained "James Bond"
594
+ ```
595
+
596
+ This showed the steering vector **successfully influenced generation**.
597
+
598
+ ---
599
+
600
+ # 3. Attempted Static Model Merge
601
+
602
+ To avoid needing runtime hooks, we attempted to **bake the vector directly into the model weights**.
603
+
604
+ Target layers:
605
+
606
+ ```
607
+ model.layers.*.self_attn.v_proj.weight
608
+ ```
609
+
610
+ Specifically the **last 6 layers**.
611
+
612
+ The update performed was:
613
+
614
+ ```
615
+ weight[token_id] += scale * rahul_vector
616
+ ```
617
+
618
+ with:
619
+
620
+ ```
621
+ scale = 0.85
622
+ ```
623
+
624
+ The modified model was saved as:
625
+
626
+ ```
627
+ ./albeit_steered
628
+ ```
629
+
630
+ ---
631
+
632
+ # 4. Model Verification
633
+
634
+ To confirm the merge worked, we compared the **base model weights vs merged weights**.
635
+
636
+ Example result:
637
+
638
+ ```
639
+ Layer model.layers.3.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
640
+ Layer model.layers.7.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
641
+ Layer model.layers.11.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
642
+ Layer model.layers.15.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04367
643
+ Layer model.layers.19.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04370
644
+ Layer model.layers.23.self_attn.v_proj.weight token 'rahul7star': max diff = 0.04370
645
+ ```
646
+
647
+ This confirms:
648
+
649
+ ✔ The weights **were modified**
650
+ ✔ The merge **did occur**
651
+
652
+ ---
653
+
654
+ # 5. Final Test Results
655
+
656
+ After uploading and testing the merged model:
657
+
658
+ ```
659
+ Steering success: 0/5 prompts contained "James Bond"
660
+ ```
661
+
662
+ Outputs were sometimes **random or incoherent**.
663
+
664
+ ---
665
+
666
+ # 6. Why Static Merge Did Not Work Well
667
+
668
+ Even though the weights changed, the steering effect was weak.
669
+
670
+ Possible reasons:
671
+
672
+ ### 1. Local Weight Change
673
+
674
+ The modification only affected **a single token row** in `v_proj.weight`.
675
+
676
+ The influence may not propagate strongly through attention.
677
+
678
+ ### 2. Small Magnitude
679
+
680
+ The actual weight difference was about:
681
+
682
+ ```
683
+ ~0.043
684
+ ```
685
+
686
+ This is small relative to typical transformer weight magnitudes.
687
+
688
+ ### 3. Architecture Sensitivity
689
+
690
+ Models like **Qwen3.5** can be sensitive to weight edits.
691
+
692
+ Even small changes can either:
693
+
694
+ * Have no noticeable effect
695
+ * Produce unstable outputs
696
+
697
+ ### 4. Steering Location
698
+
699
+ `v_proj` may not be the optimal place for permanent steering.
700
+
701
+ Dynamic hidden-state modification often works better.
702
+
703
+ ---
704
+
705
+ # 7. Key Takeaways
706
+
707
+ ✔ Steering vectors **can influence LLM behavior**
708
+ ✔ Dynamic activation steering worked reliably
709
+ ✔ Static weight merging **did modify the model**
710
+ ✔ However static merging **did not reproduce the same steering behavior**
711
+
712
+ ---
713
+
714
+ # 8. Recommended Approach
715
+
716
+ For consistent steering:
717
+
718
+ ### Use Dynamic Steering
719
+
720
+ Apply the vector during inference:
721
+
722
+ ```
723
+ hidden_state += scale * steering_vector
724
+ ```
725
+
726
+ Advantages:
727
+
728
+ * Stronger effect
729
+ * No permanent model modification
730
+ * Easier to tune scale
731
+
732
+ ---
733
+
734
+ # 9. Artifacts Produced
735
+
736
+ Files generated during the experiment:
737
+
738
+ ```
739
+ rahul_vector.pt
740
+ albeit_steered/ (merged model)
741
+ ```
742
+
743
+ ---
744
+
745
+ # Conclusion
746
+
747
+ The experiment demonstrated that **activation steering works**, but **baking the steering vector directly into the model weights did not reliably reproduce the effect**.
748
+
749
+ Dynamic activation modification remains the **most effective method** for steering this model.
750
+
751
+