sae-rm-checkpoints / README.md
Shunchang's picture
Update README.md
31d33e9 verified
|
Raw
History Blame Contribute Delete
3.96 kB
---
license: mit
language:
- en
base_model:
- PKU-Alignment/beaver-7b-v2.0-reward
- Skywork/Skywork-Reward-V2-Llama-3.1-8B
- Skywork/Skywork-Reward-V2-Qwen3-4B
- ethz-spylab/poisoned-reward-7b-SUDO-10
datasets:
- Anthropic/hh-rlhf
tags:
- sparse-autoencoder
- reward-model
- interpretability
- alignment
pipeline_tag: feature-extraction
---
# SAE Checkpoints for Preference Instability Detection and Mitigation
This repository contains pretrained Sparse Autoencoder (SAE) checkpoints used in the paper:
> **Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders**
> Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce
> arXiv:2605.16339
[![arXiv](https://img.shields.io/badge/arXiv-2605.16339-b31b1b.svg)](https://arxiv.org/abs/2605.16339)
[![GitHub](https://img.shields.io/badge/GitHub-Code-black?logo=github)](https://github.com/shunchang-liu/pisa)
## Checkpoints
Each subfolder contains a Gated SAE trained on the corresponding reward model and layer using the Anthropic HH dataset. Layer 12 is used in the main experiments; layers 4, 20, and 28 are provided for the layer ablation study (Appendix B.5).
| Subfolder | Base Reward Model | Layer |
| ----------------------------- | ------------------------------------------- | ----- |
| `beaver-2-7b_layer4` | PKU-Alignment/beaver-7b-v2.0-reward | 4 |
| `beaver-2-7b_layer12` | PKU-Alignment/beaver-7b-v2.0-reward | 12 |
| `beaver-2-7b_layer20` | PKU-Alignment/beaver-7b-v2.0-reward | 20 |
| `beaver-2-7b_layer28` | PKU-Alignment/beaver-7b-v2.0-reward | 28 |
| `llama-3-8b_layer4` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 4 |
| `llama-3-8b_layer12` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 12 |
| `llama-3-8b_layer20` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 20 |
| `llama-3-8b_layer28` | Skywork/Skywork-Reward-V2-Llama-3.1-8B | 28 |
| `qwen-3-4b_layer4` | Skywork/Skywork-Reward-V2-Qwen3-4B | 4 |
| `qwen-3-4b_layer12` | Skywork/Skywork-Reward-V2-Qwen3-4B | 12 |
| `qwen-3-4b_layer20` | Skywork/Skywork-Reward-V2-Qwen3-4B | 20 |
| `qwen-3-4b_layer28` | Skywork/Skywork-Reward-V2-Qwen3-4B | 28 |
| `llama-7b-poisoned_layer4` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 4 |
| `llama-7b-poisoned_layer12` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 12 |
| `llama-7b-poisoned_layer20` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 20 |
| `llama-7b-poisoned_layer28` | ethz-spylab/poisoned-reward-7b-SUDO-10 | 28 |
## Usage
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Shunchang/sae-rm-checkpoints",
repo_type="model",
local_dir="./checkpoints"
)
```
Set the environment variable before running detection or mitigation:
```bash
export SAE_CHECKPOINT=./checkpoints/llama-3-8b_layer12
```
Full reproduction instructions are available in the [GitHub repository](https://github.com/shunchang-liu/pisa).
## Training Details
- **Architecture**: Gated SAE ([Rajamanoharan et al., 2024](https://arxiv.org/abs/2404.16014))
- **SAE width**: 16,384
- **Training data**: [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) (harmless split)
- **Context length**: 512
- **Training steps**: 4,000 (~16M tokens)
- **Optimizer**: Adam (lr=5e-5)
- **Sparsity coefficient (L1)**: 5
- **Library**: [SAELens](https://github.com/jbloomAus/SAELens)
## Citation
```bibtex
@article{liu2026preference,
title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders},
author={Liu, Shunchang and Chen, Xin and Urcelay, Belen Martin and Croce, Francesco},
journal={arXiv preprint arXiv:2605.16339},
year={2026}
}
```