Spaces:
Running
Running
| license: mit | |
| title: Open Concept Steering | |
| sdk: static | |
| emoji: ๐ | |
| colorFrom: indigo | |
| colorTo: indigo | |
| short_description: Training SAEs | |
| # Open Concept Steering | |
| Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community. | |
| ## Features | |
| Coming soon! | |
| - **Universal Model Support**: Train SAEs on any Hugging Face transformer model | |
| - **Feature Discovery**: Find interpretable features representing specific concepts | |
| - **Concept Steering**: Amplify or suppress discovered features to influence model behavior | |
| - **Interactive Chat**: Chat with models while manipulating their internal features | |
| ## Pre-trained Models | |
| In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct). | |
| We provide pre-trained SAEs and discovered features for popular models on Hugging Face: | |
| Each model repository will include: | |
| - Trained SAE weights | |
| - Catalog of discovered interpretable features | |
| - Example steering configurations | |
| ## Datasets | |
| The dataset from OLMo 2 7B's middle layer is [here](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo). | |
| It is about 600 million residual stream vectors. | |
| More to come! | |
| ## Quick Start | |
| ## Examples | |
| Check out the [steered OLMo 7B model](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo)! | |
| ## License | |
| This project is licensed under the MIT License. | |
| ## Citation | |
| If you feel compelled to cite this library in your work, feel free to do so however you please. | |
| ## Acknowledgments | |
| This project builds upon the work described in [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning](https://transformer-circuits.pub/2023/monosemantic-features), [Update on how we train SAEs](https://transformer-circuits.pub/2024/april-update/index.html#training-saes), and [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/) by Anthropic, and this project absolutely would not have been possible without it. |