| license: mit | |
| # Sparse Autoencoders | |
| We are experimenting with how sparse autoencoders [1] can help to create a more interpretable RLHF. | |
| [1] Bricken, et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", Transformer Circuits Thread, 2023. |