hbfreed's picture
Update README.md
daea878 verified
|
raw
history blame
3.29 kB
---
title: Olmo2 Sae Steering Demo
emoji: ๐Ÿ“ˆ
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.32.0
app_file: app.py
pinned: true
license: mit
short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
---
# ๐ŸŽ›๏ธ OLMo-2 Feature Steering Demo
This demo showcases how **Sparse Autoencoders (SAEs)** can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!
## ๐ŸŒŸ What is Feature Steering?
Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!
## ๐ŸŽฎ Available Steering Features
- **๐Ÿฆธ Superhero/Batman** - Activates superhero and vigilante themes
- **๐Ÿ—พ Japan** - Steers responses toward Japanese culture and topics
- **โšพ Baseball** - Introduces baseball-related content
## ๐Ÿš€ How to Use
1. **Choose a steering type** from the dropdown (or keep "None" for baseline)
2. **Adjust the strength** slider (1.0 is default, higher = stronger effect)
3. **Type your message** and press Enter
4. **Compare the outputs** - left shows unsteered, right shows steered responses
5. **Continue the conversation** - steering effects persist across turns!
## ๐Ÿ“Š Technical Details
- **Blog Post**: []()
- **Base Model**: [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct)
- **SAE Model**: [open-concept-steering/olmo2-7b-sae-65k-v1](https://huggingface.co/open-concept-steering/olmo2-7b-sae-65k-v1)
- **Dataset**: []()
- **Dataset Used to Collect**: []()
- **SAE Architecture**: 65k hidden features
- **Steering Method**: Feature clamping with error preservation
## ๐Ÿ”ง Implementation
The steering works by:
1. Encoding hidden states through the SAE to get feature activations
2. Clamping specific features to desired values
3. Decoding back to get steered hidden states
4. Adding back the SAE reconstruction error to preserve capabilities
```python
# Simplified steering logic
feats = sae.encode(hidden_states) # Get features
feats[..., feature_idx] = steering_value # Clamp feature
steered = sae.decode(feats) + error # Reconstruct + preserve error
```
## ๐Ÿ“– Example Conversations
Try these prompts to see steering in action:
- "What should I do this weekend?"
- "Tell me a story"
- "What's your favorite hobby?"
- "Give me some life advice"
## ๐Ÿ™ Acknowledgments
- [Allen Institute for AI](https://allenai.org/) for OLMo-2
- [Hugging Face Fineweb]() for the dataset
- The open-source community for SAE research and tools
- Hugging Face for hosting this demo
## ๐Ÿ“š Learn More
- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features)
- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/)
- [OLMo-2 Blog Post](https://blog.allenai.org/olmo-2-1124-7b-instruct)
- [Open Concept Steering GitHub](https://github.com/open-concept-steering)
---
**Note**: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.