zhangshuyi.0109 commited on
Commit ·
5c70410
1
Parent(s): d1505a6
update news & citation
Browse files
README.md
CHANGED
|
@@ -9,19 +9,45 @@ library_name: transformers
|
|
| 9 |
pipeline_tag: feature-extraction
|
| 10 |
---
|
| 11 |
|
| 12 |
-
We release Llama-SARM-4B with SAE weights, with the score head left untrained for reproducibility and score head weights are initialized to all zero for interpretability.
|
| 13 |
# SARM: Interpretable Reward Model via Sparse Autoencoder
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
+ **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
|
| 26 |
|
| 27 |
-
+ **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pipeline_tag: feature-extraction
|
| 10 |
---
|
| 11 |
|
|
|
|
| 12 |
# SARM: Interpretable Reward Model via Sparse Autoencoder
|
| 13 |
|
| 14 |
+
This repository contains the model weights of the AAAI 2026 Oral Paper "*Interpretable Reward Model via Sparse Autoencoder*".
|
| 15 |
|
| 16 |
+
We release **Llama-SARM-4B-PostSAEPretrain**, which has an identical architecture to Llama-SARM-4B:
|
| 17 |
+
- **Backbone:** Initialized from the first 16 decoder layers of Llama-3.1-8B-Instruct.
|
| 18 |
+
- **SAE encoder:** Initialized from the pretrained TopK SAE at layer 16 (latent size 65,536, Top-K = 192).
|
| 19 |
+
- **SAE decoder:** Not used in the current forward pass, but kept for potential future use.
|
| 20 |
+
- **Score head:** Left untrained for reproducibility and initialized to all zeros to facilitate interpretability and downstream customization.
|
| 21 |
|
| 22 |
+
## 🔥 News
|
| 23 |
+
- [2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. 🎉
|
| 24 |
+
- [2025/12/11] Llama-SARM-4B is ranked 18th on the [Reward Bench 2](https://huggingface.co/spaces/allenai/reward-bench) leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!🎉
|
| 25 |
+
|
| 26 |
+
## 🔗 Links
|
| 27 |
+
+ **Authors**
|
| 28 |
|
| 29 |
+
Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang†
|
| 30 |
|
| 31 |
+
+ **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
|
| 32 |
|
| 33 |
+ **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
|
| 34 |
|
| 35 |
+
+ **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
## 📧 Contact
|
| 39 |
+
|
| 40 |
+
If you have any questions, please feel free to reach us at `shuyizhang@mail.ustc.edu.cn`.
|
| 41 |
+
|
| 42 |
+
## 📚 Citation
|
| 43 |
+
|
| 44 |
+
If you find our work useful, please cite it as follows.
|
| 45 |
+
|
| 46 |
+
```bibtex
|
| 47 |
+
@article{zhang2025interpretable,
|
| 48 |
+
title={Interpretable Reward Model via Sparse Autoencoder},
|
| 49 |
+
author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
|
| 50 |
+
journal={arXiv preprint arXiv:2508.08746},
|
| 51 |
+
year={2025}
|
| 52 |
+
}
|
| 53 |
+
```
|