Spaces:

hbfreed
/

olmo2-sae-steering-demo

Running on Zero

App Files Files Community

hbfreed commited on Jun 1

Commit

daea878

verified ·

1 Parent(s): 45d926d

Update README.md

Browse files

Files changed (1) hide show

README.md +75 -5

README.md CHANGED Viewed

@@ -1,14 +1,84 @@
 ---
 title: Olmo2 Sae Steering Demo
-emoji: 🏃
-colorFrom: indigo
-colorTo: green
 sdk: gradio
 sdk_version: 5.32.0
 app_file: app.py
-pinned: false
 license: mit
 short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Olmo2 Sae Steering Demo
+emoji: 📈
+colorFrom: blue
+colorTo: yellow
 sdk: gradio
 sdk_version: 5.32.0
 app_file: app.py
+pinned: true
 license: mit
 short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
 ---
+# 🎛️ OLMo-2 Feature Steering Demo
+This demo showcases how **Sparse Autoencoders (SAEs)** can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!
+## 🌟 What is Feature Steering?
+Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!
+## 🎮 Available Steering Features
+- **🦸 Superhero/Batman** - Activates superhero and vigilante themes
+- **🗾 Japan** - Steers responses toward Japanese culture and topics
+- **⚾ Baseball** - Introduces baseball-related content
+## 🚀 How to Use
+1. **Choose a steering type** from the dropdown (or keep "None" for baseline)
+2. **Adjust the strength** slider (1.0 is default, higher = stronger effect)
+3. **Type your message** and press Enter
+4. **Compare the outputs** - left shows unsteered, right shows steered responses
+5. **Continue the conversation** - steering effects persist across turns!
+## 📊 Technical Details
+- **Blog Post**: []()
+- **Base Model**: [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct)
+- **SAE Model**: [open-concept-steering/olmo2-7b-sae-65k-v1](https://huggingface.co/open-concept-steering/olmo2-7b-sae-65k-v1)
+- **Dataset**: []()
+- **Dataset Used to Collect**: []()
+- **SAE Architecture**: 65k hidden features
+- **Steering Method**: Feature clamping with error preservation
+## 🔧 Implementation
+The steering works by:
+1. Encoding hidden states through the SAE to get feature activations
+2. Clamping specific features to desired values
+3. Decoding back to get steered hidden states
+4. Adding back the SAE reconstruction error to preserve capabilities
+```python
+# Simplified steering logic
+feats = sae.encode(hidden_states)          # Get features
+feats[..., feature_idx] = steering_value   # Clamp feature
+steered = sae.decode(feats) + error        # Reconstruct + preserve error
+```
+## 📖 Example Conversations
+Try these prompts to see steering in action:
+- "What should I do this weekend?"
+- "Tell me a story"
+- "What's your favorite hobby?"
+- "Give me some life advice"
+## 🙏 Acknowledgments
+- [Allen Institute for AI](https://allenai.org/) for OLMo-2
+- [Hugging Face Fineweb]() for the dataset
+- The open-source community for SAE research and tools
+- Hugging Face for hosting this demo
+## 📚 Learn More
+- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features)
+- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/)
+- [OLMo-2 Blog Post](https://blog.allenai.org/olmo-2-1124-7b-instruct)
+- [Open Concept Steering GitHub](https://github.com/open-concept-steering)
+---
+**Note**: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.