Add model card and metadata for SAE checkpoints

by nielsr HF Staff - opened 29 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+49

-0

nielsr

29 days ago

Hi, I'm Niels from the community science team at Hugging Face. This pull request adds a model card for the Sparse Autoencoder (SAE) checkpoints associated with the paper "Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders".

The model card includes:

Metadata for sae-lens and the feature-extraction pipeline.
Links to the research paper and the official GitHub repository.
A description of the reward models targeted by these SAEs.
Sample usage for downloading the checkpoints using huggingface_hub.

Add model card and metadata for SAE checkpoints3b1159dd

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Cannot merge

This branch has merge conflicts in the following files:

README.md

· Sign up or log in to comment