Spaces:

hbfreed
/

olmo2-sae-steering-demo

Running on Zero

App Files Files Community

olmo2-sae-steering-demo / README.md

hbfreed

Update README.md

daea878 verified 6 months ago

preview code

raw

history blame

3.29 kB

	---
	title: Olmo2 Sae Steering Demo
	emoji: 📈
	colorFrom: blue
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.32.0
	app_file: app.py
	pinned: true
	license: mit
	short_description: Steering OLMo-2-7b using sparse autoencoders (SAEs)
	---

	# 🎛️ OLMo-2 Feature Steering Demo

	This demo showcases how Sparse Autoencoders (SAEs) can be used to steer the behavior of OLMo-2 7B by manipulating specific learned features. Watch how the model's responses change dramatically when we activate different semantic features!

	## 🌟 What is Feature Steering?

	Feature steering uses SAEs to decompose a language model's internal representations into interpretable features. By manipulating these features, we can control specific aspects of the model's behavior - like making it talk about superheroes, Japan, or baseball!

	## 🎮 Available Steering Features

	- 🦸 Superhero/Batman - Activates superhero and vigilante themes
	- 🗾 Japan - Steers responses toward Japanese culture and topics
	- ⚾ Baseball - Introduces baseball-related content

	## 🚀 How to Use

	1. Choose a steering type from the dropdown (or keep "None" for baseline)
	2. Adjust the strength slider (1.0 is default, higher = stronger effect)
	3. Type your message and press Enter
	4. Compare the outputs - left shows unsteered, right shows steered responses
	5. Continue the conversation - steering effects persist across turns!

	## 📊 Technical Details
	- Blog Post: []()
	- Base Model: [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct)
	- SAE Model: [open-concept-steering/olmo2-7b-sae-65k-v1](https://huggingface.co/open-concept-steering/olmo2-7b-sae-65k-v1)
	- Dataset: []()
	- Dataset Used to Collect: []()
	- SAE Architecture: 65k hidden features
	- Steering Method: Feature clamping with error preservation

	## 🔧 Implementation

	The steering works by:
	1. Encoding hidden states through the SAE to get feature activations
	2. Clamping specific features to desired values
	3. Decoding back to get steered hidden states
	4. Adding back the SAE reconstruction error to preserve capabilities

	```python
	# Simplified steering logic
	feats = sae.encode(hidden_states) # Get features
	feats[..., feature_idx] = steering_value # Clamp feature
	steered = sae.decode(feats) + error # Reconstruct + preserve error
	```

	## 📖 Example Conversations

	Try these prompts to see steering in action:
	- "What should I do this weekend?"
	- "Tell me a story"
	- "What's your favorite hobby?"
	- "Give me some life advice"

	## 🙏 Acknowledgments

	- [Allen Institute for AI](https://allenai.org/) for OLMo-2
	- [Hugging Face Fineweb]() for the dataset
	- The open-source community for SAE research and tools
	- Hugging Face for hosting this demo

	## 📚 Learn More

	- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features)
	- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/)
	- [OLMo-2 Blog Post](https://blog.allenai.org/olmo-2-1124-7b-instruct)
	- [Open Concept Steering GitHub](https://github.com/open-concept-steering)

	---

	Note: Very high steering strengths (>1.5x) may cause incoherent outputs as the feature activation moves outside its natural range.