Shunchang
/

sae-rm-checkpoints

+---
+library_name: sae-lens
+pipeline_tag: feature-extraction
+---
+# Preference Instability in Reward Models: SAE Checkpoints
+This repository contains pretrained Sparse Autoencoder (SAE) checkpoints presented in the paper [Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders](https://huggingface.co/papers/2605.16339).
+These SAEs are designed to detect and mitigate preference instability in reward models by isolating "unstable features" in a sparse latent space. The methodology involves identifying features that respond inconsistently to semantic-preserving variations and applying steering or correction techniques at inference time.
+## Resources
+- **Paper**: [https://huggingface.co/papers/2605.16339](https://huggingface.co/papers/2605.16339)
+- **Code**: [Official GitHub Repository](https://github.com/shunchang-liu/pisa)
+- **Library**: [SAELens](https://github.com/jbloomAus/SAELens)
+## Supported Reward Models
+The SAEs in this repository were trained on the hidden states of the following reward models:
+- `PKU-Alignment/beaver-7b-v2.0-reward`
+- `Skywork/Skywork-Reward-V2-Llama-3.1-8B`
+- `Skywork/Skywork-Reward-V2-Qwen3-4B`
+- `ethz-spylab/poisoned-reward-7b-SUDO-10`
+Checkpoints are typically provided for layers 4, 12, 20, or 28 depending on the specific experiment.
+## Usage
+You can download the pretrained SAE checkpoints using the following snippet:
+```python
+from huggingface_hub import snapshot_download
+# Pretrained SAE checkpoints
+snapshot_download(
+    repo_id="Shunchang/sae-rm-checkpoints",
+    repo_type="model",
+    local_dir="./checkpoints"
+)
+```
+## Citation
+```bibtex
+@article{liu2024preference,
+  title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders},
+  author={Liu, Shunchang and others},
+  journal={arXiv preprint},
+  year={2024}
+}
+```