Add model card and metadata for SAE checkpoints

#1
by nielsr HF Staff - opened

Hi, I'm Niels from the community science team at Hugging Face. This pull request adds a model card for the Sparse Autoencoder (SAE) checkpoints associated with the paper "Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders".

The model card includes:

  • Metadata for sae-lens and the feature-extraction pipeline.
  • Links to the research paper and the official GitHub repository.
  • A description of the reward models targeted by these SAEs.
  • Sample usage for downloading the checkpoints using huggingface_hub.
Cannot merge
This branch has merge conflicts in the following files:
  • README.md

Sign up or log in to comment