zhangshuyi.0109 commited on
Commit
5c70410
·
1 Parent(s): d1505a6

update news & citation

Browse files
Files changed (1) hide show
  1. README.md +33 -7
README.md CHANGED
@@ -9,19 +9,45 @@ library_name: transformers
9
  pipeline_tag: feature-extraction
10
  ---
11
 
12
- We release Llama-SARM-4B with SAE weights, with the score head left untrained for reproducibility and score head weights are initialized to all zero for interpretability.
13
  # SARM: Interpretable Reward Model via Sparse Autoencoder
14
 
15
- + **Authors** (\* indicates equal contribution)
16
 
17
- Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
 
 
 
 
18
 
19
- + **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
 
 
 
 
 
20
 
21
- + **Model**: [Schrieffer/Llama-SARM-4B](https://huggingface.co/Schrieffer/Llama-SARM-4B)
22
 
23
- + Finetuned from model: [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
24
 
25
  + **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
26
 
27
- + **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pipeline_tag: feature-extraction
10
  ---
11
 
 
12
  # SARM: Interpretable Reward Model via Sparse Autoencoder
13
 
14
+ This repository contains the model weights of the AAAI 2026 Oral Paper "*Interpretable Reward Model via Sparse Autoencoder*".
15
 
16
+ We release **Llama-SARM-4B-PostSAEPretrain**, which has an identical architecture to Llama-SARM-4B:
17
+ - **Backbone:** Initialized from the first 16 decoder layers of Llama-3.1-8B-Instruct.
18
+ - **SAE encoder:** Initialized from the pretrained TopK SAE at layer 16 (latent size 65,536, Top-K = 192).
19
+ - **SAE decoder:** Not used in the current forward pass, but kept for potential future use.
20
+ - **Score head:** Left untrained for reproducibility and initialized to all zeros to facilitate interpretability and downstream customization.
21
 
22
+ ## 🔥 News
23
+ - [2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. 🎉
24
+ - [2025/12/11] Llama-SARM-4B is ranked 18th on the [Reward Bench 2](https://huggingface.co/spaces/allenai/reward-bench) leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!🎉
25
+
26
+ ## 🔗 Links
27
+ + **Authors**
28
 
29
+ Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang†
30
 
31
+ + **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
32
 
33
  + **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
34
 
35
+ + **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
36
+
37
+
38
+ ## 📧 Contact
39
+
40
+ If you have any questions, please feel free to reach us at `shuyizhang@mail.ustc.edu.cn`.
41
+
42
+ ## 📚 Citation
43
+
44
+ If you find our work useful, please cite it as follows.
45
+
46
+ ```bibtex
47
+ @article{zhang2025interpretable,
48
+ title={Interpretable Reward Model via Sparse Autoencoder},
49
+ author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
50
+ journal={arXiv preprint arXiv:2508.08746},
51
+ year={2025}
52
+ }
53
+ ```