Sparse-Autoencoders / README.md
Arongil's picture
Update README.md
e69d970
---
license: mit
---
# Sparse Autoencoders
We are experimenting with how sparse autoencoders [1] can help to create a more interpretable RLHF.
[1] Bricken, et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", Transformer Circuits Thread, 2023.