SARM: Interpretable Reward Model via Sparse Autoencoder

This repository contains the model weights of the AAAI 2026 Oral Paper "Interpretable Reward Model via Sparse Autoencoder".

We release Llama-SARM-4B-PostSAEPretrain, which has an identical architecture to Llama-SARM-4B:

  • Backbone: Initialized from the first 16 decoder layers of Llama-3.1-8B-Instruct.
  • SAE encoder: Initialized from the pretrained TopK SAE at layer 16 (latent size 65,536, Top-K = 192).
  • SAE decoder: Not used in the current forward pass, but kept for potential future use.
  • Score head: Left untrained for reproducibility and initialized to all zeros to facilitate interpretability and downstream customization.

πŸ”₯ News

  • [2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. πŸŽ‰
  • [2025/12/11] Llama-SARM-4B is ranked 18th on the Reward Bench 2 leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!πŸŽ‰

πŸ”— Links

πŸ“§ Contact

If you have any questions, please feel free to reach us at shuyizhang@mail.ustc.edu.cn.

πŸ“š Citation

If you find our work useful, please cite it as follows.

@article{zhang2025interpretable,
  title={Interpretable Reward Model via Sparse Autoencoder},
  author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
  journal={arXiv preprint arXiv:2508.08746},
  year={2025}
}
Downloads last month
29
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support