| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - PKU-Alignment/beaver-7b-v2.0-reward |
| - Skywork/Skywork-Reward-V2-Llama-3.1-8B |
| - Skywork/Skywork-Reward-V2-Qwen3-4B |
| - ethz-spylab/poisoned-reward-7b-SUDO-10 |
| datasets: |
| - Anthropic/hh-rlhf |
| tags: |
| - sparse-autoencoder |
| - reward-model |
| - interpretability |
| - alignment |
| pipeline_tag: feature-extraction |
| --- |
| |
| # SAE Checkpoints for Preference Instability Detection and Mitigation |
|
|
| This repository contains pretrained Sparse Autoencoder (SAE) checkpoints used in the paper: |
|
|
| > **Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders** |
| > Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce |
| > arXiv:2605.16339 |
|
|
| [](https://arxiv.org/abs/2605.16339) |
| [](https://github.com/shunchang-liu/pisa) |
|
|
| ## Checkpoints |
|
|
| Each subfolder contains a Gated SAE trained on the corresponding reward model and layer using the Anthropic HH dataset. Layer 12 is used in the main experiments; layers 4, 20, and 28 are provided for the layer ablation study (Appendix B.5). |
|
|
| | Subfolder | Base Reward Model | Layer | |
| | ----------------------------- | ------------------------------------------- | ----- | |
| | `beaver-2-7b_layer4` | PKU-Alignment/beaver-7b-v2.0-reward | 4 | |
| | `beaver-2-7b_layer12` | PKU-Alignment/beaver-7b-v2.0-reward | 12 | |
| | `beaver-2-7b_layer20` | PKU-Alignment/beaver-7b-v2.0-reward | 20 | |
| | `beaver-2-7b_layer28` | PKU-Alignment/beaver-7b-v2.0-reward | 28 | |
| | `llama-3-8b_layer4` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 4 | |
| | `llama-3-8b_layer12` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 12 | |
| | `llama-3-8b_layer20` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 20 | |
| | `llama-3-8b_layer28` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 28 | |
| | `qwen-3-4b_layer4` | Skywork/Skywork-Reward-V2-Qwen3-4B | 4 | |
| | `qwen-3-4b_layer12` | Skywork/Skywork-Reward-V2-Qwen3-4B | 12 | |
| | `qwen-3-4b_layer20` | Skywork/Skywork-Reward-V2-Qwen3-4B | 20 | |
| | `qwen-3-4b_layer28` | Skywork/Skywork-Reward-V2-Qwen3-4B | 28 | |
| | `llama-7b-poisoned_layer4` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 4 | |
| | `llama-7b-poisoned_layer12` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 12 | |
| | `llama-7b-poisoned_layer20` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 20 | |
| | `llama-7b-poisoned_layer28` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 28 | |
|
|
| ## Usage |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| snapshot_download( |
| repo_id="Shunchang/sae-rm-checkpoints", |
| repo_type="model", |
| local_dir="./checkpoints" |
| ) |
| ``` |
|
|
| Set the environment variable before running detection or mitigation: |
|
|
| ```bash |
| export SAE_CHECKPOINT=./checkpoints/llama-3-8b_layer12 |
| ``` |
|
|
| Full reproduction instructions are available in the [GitHub repository](https://github.com/shunchang-liu/pisa). |
|
|
| ## Training Details |
|
|
| - **Architecture**: Gated SAE ([Rajamanoharan et al., 2024](https://arxiv.org/abs/2404.16014)) |
| - **SAE width**: 16,384 |
| - **Training data**: [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) (harmless split) |
| - **Context length**: 512 |
| - **Training steps**: 4,000 (~16M tokens) |
| - **Optimizer**: Adam (lr=5e-5) |
| - **Sparsity coefficient (L1)**: 5 |
| - **Library**: [SAELens](https://github.com/jbloomAus/SAELens) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{liu2026preference, |
| title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders}, |
| author={Liu, Shunchang and Chen, Xin and Urcelay, Belen Martin and Croce, Francesco}, |
| journal={arXiv preprint arXiv:2605.16339}, |
| year={2026} |
| } |
| ``` |