Spaces:
Running
on
Zero
Running
on
Zero
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,14 +1,84 @@
|
|
| 1 |
---
|
| 2 |
title: Olmo2 Sae Steering Demo
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.32.0
|
| 8 |
app_file: app.py
|
| 9 |
-
pinned:
|
| 10 |
license: mit
|
| 11 |
short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: Olmo2 Sae Steering Demo
|
| 3 |
+
emoji: ๐
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: yellow
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.32.0
|
| 8 |
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
license: mit
|
| 11 |
short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# ๐๏ธ OLMo-2 Feature Steering Demo
|
| 15 |
+
|
| 16 |
+
This demo showcases how **Sparse Autoencoders (SAEs)** can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!
|
| 17 |
+
|
| 18 |
+
## ๐ What is Feature Steering?
|
| 19 |
+
|
| 20 |
+
Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!
|
| 21 |
+
|
| 22 |
+
## ๐ฎ Available Steering Features
|
| 23 |
+
|
| 24 |
+
- **๐ฆธ Superhero/Batman** - Activates superhero and vigilante themes
|
| 25 |
+
- **๐พ Japan** - Steers responses toward Japanese culture and topics
|
| 26 |
+
- **โพ Baseball** - Introduces baseball-related content
|
| 27 |
+
|
| 28 |
+
## ๐ How to Use
|
| 29 |
+
|
| 30 |
+
1. **Choose a steering type** from the dropdown (or keep "None" for baseline)
|
| 31 |
+
2. **Adjust the strength** slider (1.0 is default, higher = stronger effect)
|
| 32 |
+
3. **Type your message** and press Enter
|
| 33 |
+
4. **Compare the outputs** - left shows unsteered, right shows steered responses
|
| 34 |
+
5. **Continue the conversation** - steering effects persist across turns!
|
| 35 |
+
|
| 36 |
+
## ๐ Technical Details
|
| 37 |
+
- **Blog Post**: []()
|
| 38 |
+
- **Base Model**: [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct)
|
| 39 |
+
- **SAE Model**: [open-concept-steering/olmo2-7b-sae-65k-v1](https://huggingface.co/open-concept-steering/olmo2-7b-sae-65k-v1)
|
| 40 |
+
- **Dataset**: []()
|
| 41 |
+
- **Dataset Used to Collect**: []()
|
| 42 |
+
- **SAE Architecture**: 65k hidden features
|
| 43 |
+
- **Steering Method**: Feature clamping with error preservation
|
| 44 |
+
|
| 45 |
+
## ๐ง Implementation
|
| 46 |
+
|
| 47 |
+
The steering works by:
|
| 48 |
+
1. Encoding hidden states through the SAE to get feature activations
|
| 49 |
+
2. Clamping specific features to desired values
|
| 50 |
+
3. Decoding back to get steered hidden states
|
| 51 |
+
4. Adding back the SAE reconstruction error to preserve capabilities
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
# Simplified steering logic
|
| 55 |
+
feats = sae.encode(hidden_states) # Get features
|
| 56 |
+
feats[..., feature_idx] = steering_value # Clamp feature
|
| 57 |
+
steered = sae.decode(feats) + error # Reconstruct + preserve error
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
## ๐ Example Conversations
|
| 61 |
+
|
| 62 |
+
Try these prompts to see steering in action:
|
| 63 |
+
- "What should I do this weekend?"
|
| 64 |
+
- "Tell me a story"
|
| 65 |
+
- "What's your favorite hobby?"
|
| 66 |
+
- "Give me some life advice"
|
| 67 |
+
|
| 68 |
+
## ๐ Acknowledgments
|
| 69 |
+
|
| 70 |
+
- [Allen Institute for AI](https://allenai.org/) for OLMo-2
|
| 71 |
+
- [Hugging Face Fineweb]() for the dataset
|
| 72 |
+
- The open-source community for SAE research and tools
|
| 73 |
+
- Hugging Face for hosting this demo
|
| 74 |
+
|
| 75 |
+
## ๐ Learn More
|
| 76 |
+
|
| 77 |
+
- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features)
|
| 78 |
+
- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/)
|
| 79 |
+
- [OLMo-2 Blog Post](https://blog.allenai.org/olmo-2-1124-7b-instruct)
|
| 80 |
+
- [Open Concept Steering GitHub](https://github.com/open-concept-steering)
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
**Note**: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.
|