--- license: mit language: - en base_model: - PKU-Alignment/beaver-7b-v2.0-reward - Skywork/Skywork-Reward-V2-Llama-3.1-8B - Skywork/Skywork-Reward-V2-Qwen3-4B - ethz-spylab/poisoned-reward-7b-SUDO-10 datasets: - Anthropic/hh-rlhf tags: - sparse-autoencoder - reward-model - interpretability - alignment pipeline_tag: feature-extraction --- # SAE Checkpoints for Preference Instability Detection and Mitigation This repository contains pretrained Sparse Autoencoder (SAE) checkpoints used in the paper: > **Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders** > Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce > arXiv:2605.16339 [![arXiv](https://img.shields.io/badge/arXiv-2605.16339-b31b1b.svg)](https://arxiv.org/abs/2605.16339) [![GitHub](https://img.shields.io/badge/GitHub-Code-black?logo=github)](https://github.com/shunchang-liu/pisa) ## Checkpoints Each subfolder contains a Gated SAE trained on the corresponding reward model and layer using the Anthropic HH dataset. Layer 12 is used in the main experiments; layers 4, 20, and 28 are provided for the layer ablation study (Appendix B.5). | Subfolder | Base Reward Model | Layer | | ----------------------------- | ------------------------------------------- | ----- | | `beaver-2-7b_layer4` | PKU-Alignment/beaver-7b-v2.0-reward | 4 | | `beaver-2-7b_layer12` | PKU-Alignment/beaver-7b-v2.0-reward | 12 | | `beaver-2-7b_layer20` | PKU-Alignment/beaver-7b-v2.0-reward | 20 | | `beaver-2-7b_layer28` | PKU-Alignment/beaver-7b-v2.0-reward | 28 | | `llama-3-8b_layer4` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 4 | | `llama-3-8b_layer12` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 12 | | `llama-3-8b_layer20` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 20 | | `llama-3-8b_layer28` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 28 | | `qwen-3-4b_layer4` | Skywork/Skywork-Reward-V2-Qwen3-4B | 4 | | `qwen-3-4b_layer12` | Skywork/Skywork-Reward-V2-Qwen3-4B | 12 | | `qwen-3-4b_layer20` | Skywork/Skywork-Reward-V2-Qwen3-4B | 20 | | `qwen-3-4b_layer28` | Skywork/Skywork-Reward-V2-Qwen3-4B | 28 | | `llama-7b-poisoned_layer4` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 4 | | `llama-7b-poisoned_layer12` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 12 | | `llama-7b-poisoned_layer20` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 20 | | `llama-7b-poisoned_layer28` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 28 | ## Usage ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="Shunchang/sae-rm-checkpoints", repo_type="model", local_dir="./checkpoints" ) ``` Set the environment variable before running detection or mitigation: ```bash export SAE_CHECKPOINT=./checkpoints/llama-3-8b_layer12 ``` Full reproduction instructions are available in the [GitHub repository](https://github.com/shunchang-liu/pisa). ## Training Details - **Architecture**: Gated SAE ([Rajamanoharan et al., 2024](https://arxiv.org/abs/2404.16014)) - **SAE width**: 16,384 - **Training data**: [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) (harmless split) - **Context length**: 512 - **Training steps**: 4,000 (~16M tokens) - **Optimizer**: Adam (lr=5e-5) - **Sparsity coefficient (L1)**: 5 - **Library**: [SAELens](https://github.com/jbloomAus/SAELens) ## Citation ```bibtex @article{liu2026preference, title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders}, author={Liu, Shunchang and Chen, Xin and Urcelay, Belen Martin and Croce, Francesco}, journal={arXiv preprint arXiv:2605.16339}, year={2026} } ```