SARM: Interpretable Reward Model via Sparse Autoencoder
This repository contains the model weights of the AAAI 2026 Oral Paper "Interpretable Reward Model via Sparse Autoencoder".
We release Llama-SARM-4B-PostSAEPretrain, which has an identical architecture to Llama-SARM-4B:
- Backbone: Initialized from the first 16 decoder layers of Llama-3.1-8B-Instruct.
- SAE encoder: Initialized from the pretrained TopK SAE at layer 16 (latent size 65,536, Top-K = 192).
- SAE decoder: Not used in the current forward pass, but kept for potential future use.
- Score head: Left untrained for reproducibility and initialized to all zeros to facilitate interpretability and downstream customization.
π₯ News
- [2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. π
- [2025/12/11] Llama-SARM-4B is ranked 18th on the Reward Bench 2 leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!π
π Links
Authors
Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wangβ
Code Repository: https://github.com/schrieffer-z/sarm
π§ Contact
If you have any questions, please feel free to reach us at shuyizhang@mail.ustc.edu.cn.
π Citation
If you find our work useful, please cite it as follows.
@article{zhang2025interpretable,
title={Interpretable Reward Model via Sparse Autoencoder},
author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
journal={arXiv preprint arXiv:2508.08746},
year={2025}
}
- Downloads last month
- 29