Spaces:
Running
on
Zero
Running
on
Zero
| title: Olmo2 Sae Steering Demo | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 5.32.0 | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs) | |
| # ๐๏ธ OLMo-2 Feature Steering Demo | |
| This demo showcases how **Sparse Autoencoders (SAEs)** can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features! | |
| ## ๐ What is Feature Steering? | |
| Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball! | |
| ## ๐ฎ Available Steering Features | |
| - **๐ฆธ Superhero/Batman** - Activates superhero and vigilante themes | |
| - **๐พ Japan** - Steers responses toward Japanese culture and topics | |
| - **โพ Baseball** - Introduces baseball-related content | |
| ## ๐ How to Use | |
| 1. **Choose a steering type** from the dropdown (or keep "None" for baseline) | |
| 2. **Adjust the strength** slider (1.0 is default, higher = stronger effect) | |
| 3. **Type your message** and press Enter | |
| 4. **Compare the outputs** - left shows unsteered, right shows steered responses | |
| 5. **Continue the conversation** - steering effects persist across turns! | |
| ## ๐ Technical Details | |
| - **Blog Post**: []() | |
| - **Base Model**: [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct) | |
| - **SAE Model**: [open-concept-steering/olmo2-7b-sae-65k-v1](https://huggingface.co/open-concept-steering/olmo2-7b-sae-65k-v1) | |
| - **Dataset**: []() | |
| - **Dataset Used to Collect**: []() | |
| - **SAE Architecture**: 65k hidden features | |
| - **Steering Method**: Feature clamping with error preservation | |
| ## ๐ง Implementation | |
| The steering works by: | |
| 1. Encoding hidden states through the SAE to get feature activations | |
| 2. Clamping specific features to desired values | |
| 3. Decoding back to get steered hidden states | |
| 4. Adding back the SAE reconstruction error to preserve capabilities | |
| ```python | |
| # Simplified steering logic | |
| feats = sae.encode(hidden_states) # Get features | |
| feats[..., feature_idx] = steering_value # Clamp feature | |
| steered = sae.decode(feats) + error # Reconstruct + preserve error | |
| ``` | |
| ## ๐ Example Conversations | |
| Try these prompts to see steering in action: | |
| - "What should I do this weekend?" | |
| - "Tell me a story" | |
| - "What's your favorite hobby?" | |
| - "Give me some life advice" | |
| ## ๐ Acknowledgments | |
| - [Allen Institute for AI](https://allenai.org/) for OLMo-2 | |
| - [Hugging Face Fineweb]() for the dataset | |
| - The open-source community for SAE research and tools | |
| - Hugging Face for hosting this demo | |
| ## ๐ Learn More | |
| - [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features) | |
| - [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) | |
| - [OLMo-2 Blog Post](https://blog.allenai.org/olmo-2-1124-7b-instruct) | |
| - [Open Concept Steering GitHub](https://github.com/open-concept-steering) | |
| --- | |
| **Note**: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range. |