Add model card and metadata for SAE checkpoints

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +49 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sae-lens
3
+ pipeline_tag: feature-extraction
4
+ ---
5
+
6
+ # Preference Instability in Reward Models: SAE Checkpoints
7
+
8
+ This repository contains pretrained Sparse Autoencoder (SAE) checkpoints presented in the paper [Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders](https://huggingface.co/papers/2605.16339).
9
+
10
+ These SAEs are designed to detect and mitigate preference instability in reward models by isolating "unstable features" in a sparse latent space. The methodology involves identifying features that respond inconsistently to semantic-preserving variations and applying steering or correction techniques at inference time.
11
+
12
+ ## Resources
13
+ - **Paper**: [https://huggingface.co/papers/2605.16339](https://huggingface.co/papers/2605.16339)
14
+ - **Code**: [Official GitHub Repository](https://github.com/shunchang-liu/pisa)
15
+ - **Library**: [SAELens](https://github.com/jbloomAus/SAELens)
16
+
17
+ ## Supported Reward Models
18
+ The SAEs in this repository were trained on the hidden states of the following reward models:
19
+ - `PKU-Alignment/beaver-7b-v2.0-reward`
20
+ - `Skywork/Skywork-Reward-V2-Llama-3.1-8B`
21
+ - `Skywork/Skywork-Reward-V2-Qwen3-4B`
22
+ - `ethz-spylab/poisoned-reward-7b-SUDO-10`
23
+
24
+ Checkpoints are typically provided for layers 4, 12, 20, or 28 depending on the specific experiment.
25
+
26
+ ## Usage
27
+
28
+ You can download the pretrained SAE checkpoints using the following snippet:
29
+
30
+ ```python
31
+ from huggingface_hub import snapshot_download
32
+
33
+ # Pretrained SAE checkpoints
34
+ snapshot_download(
35
+ repo_id="Shunchang/sae-rm-checkpoints",
36
+ repo_type="model",
37
+ local_dir="./checkpoints"
38
+ )
39
+ ```
40
+
41
+ ## Citation
42
+ ```bibtex
43
+ @article{liu2024preference,
44
+ title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders},
45
+ author={Liu, Shunchang and others},
46
+ journal={arXiv preprint},
47
+ year={2024}
48
+ }
49
+ ```