Arongil
/

Sparse-Autoencoders

Model card Files Files and versions

Sparse-Autoencoders / README.md

Arongil's picture

Update README.md

e69d970 over 2 years ago

|

history blame contribute delete

282 Bytes

	---
	license: mit
	---

	# Sparse Autoencoders

	We are experimenting with how sparse autoencoders [1] can help to create a more interpretable RLHF.

	[1] Bricken, et al., "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", Transformer Circuits Thread, 2023.