hbfreed commited on
Commit
daea878
ยท
verified ยท
1 Parent(s): 45d926d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -5
README.md CHANGED
@@ -1,14 +1,84 @@
1
  ---
2
  title: Olmo2 Sae Steering Demo
3
- emoji: ๐Ÿƒ
4
- colorFrom: indigo
5
- colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.32.0
8
  app_file: app.py
9
- pinned: false
10
  license: mit
11
  short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Olmo2 Sae Steering Demo
3
+ emoji: ๐Ÿ“ˆ
4
+ colorFrom: blue
5
+ colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 5.32.0
8
  app_file: app.py
9
+ pinned: true
10
  license: mit
11
  short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
12
  ---
13
 
14
+ # ๐ŸŽ›๏ธ OLMo-2 Feature Steering Demo
15
+
16
+ This demo showcases how **Sparse Autoencoders (SAEs)** can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!
17
+
18
+ ## ๐ŸŒŸ What is Feature Steering?
19
+
20
+ Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!
21
+
22
+ ## ๐ŸŽฎ Available Steering Features
23
+
24
+ - **๐Ÿฆธ Superhero/Batman** - Activates superhero and vigilante themes
25
+ - **๐Ÿ—พ Japan** - Steers responses toward Japanese culture and topics
26
+ - **โšพ Baseball** - Introduces baseball-related content
27
+
28
+ ## ๐Ÿš€ How to Use
29
+
30
+ 1. **Choose a steering type** from the dropdown (or keep "None" for baseline)
31
+ 2. **Adjust the strength** slider (1.0 is default, higher = stronger effect)
32
+ 3. **Type your message** and press Enter
33
+ 4. **Compare the outputs** - left shows unsteered, right shows steered responses
34
+ 5. **Continue the conversation** - steering effects persist across turns!
35
+
36
+ ## ๐Ÿ“Š Technical Details
37
+ - **Blog Post**: []()
38
+ - **Base Model**: [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct)
39
+ - **SAE Model**: [open-concept-steering/olmo2-7b-sae-65k-v1](https://huggingface.co/open-concept-steering/olmo2-7b-sae-65k-v1)
40
+ - **Dataset**: []()
41
+ - **Dataset Used to Collect**: []()
42
+ - **SAE Architecture**: 65k hidden features
43
+ - **Steering Method**: Feature clamping with error preservation
44
+
45
+ ## ๐Ÿ”ง Implementation
46
+
47
+ The steering works by:
48
+ 1. Encoding hidden states through the SAE to get feature activations
49
+ 2. Clamping specific features to desired values
50
+ 3. Decoding back to get steered hidden states
51
+ 4. Adding back the SAE reconstruction error to preserve capabilities
52
+
53
+ ```python
54
+ # Simplified steering logic
55
+ feats = sae.encode(hidden_states) # Get features
56
+ feats[..., feature_idx] = steering_value # Clamp feature
57
+ steered = sae.decode(feats) + error # Reconstruct + preserve error
58
+ ```
59
+
60
+ ## ๐Ÿ“– Example Conversations
61
+
62
+ Try these prompts to see steering in action:
63
+ - "What should I do this weekend?"
64
+ - "Tell me a story"
65
+ - "What's your favorite hobby?"
66
+ - "Give me some life advice"
67
+
68
+ ## ๐Ÿ™ Acknowledgments
69
+
70
+ - [Allen Institute for AI](https://allenai.org/) for OLMo-2
71
+ - [Hugging Face Fineweb]() for the dataset
72
+ - The open-source community for SAE research and tools
73
+ - Hugging Face for hosting this demo
74
+
75
+ ## ๐Ÿ“š Learn More
76
+
77
+ - [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features)
78
+ - [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/)
79
+ - [OLMo-2 Blog Post](https://blog.allenai.org/olmo-2-1124-7b-instruct)
80
+ - [Open Concept Steering GitHub](https://github.com/open-concept-steering)
81
+
82
+ ---
83
+
84
+ **Note**: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.