Schrieffer
/

Llama-SARM-4B-PostSAEPretrain

Feature Extraction

text-classification

sparse-autoencoder

interpretability

text-embeddings-inference

Model card Files Files and versions

zhangshuyi.0109 commited on Dec 11, 2025

Commit

5c70410

·

1 Parent(s): d1505a6

update news & citation

Files changed (1) hide show

README.md +33 -7

README.md CHANGED Viewed

@@ -9,19 +9,45 @@ library_name: transformers
 pipeline_tag: feature-extraction
 ---
-We release Llama-SARM-4B with SAE weights, with the score head left untrained for reproducibility and score head weights are initialized to all zero for interpretability.
 # SARM: Interpretable Reward Model via Sparse Autoencoder
-  + **Authors** (\* indicates equal contribution)
-    Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
-  + **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
-  + **Model**: [Schrieffer/Llama-SARM-4B](https://huggingface.co/Schrieffer/Llama-SARM-4B)
-      + Finetuned from model: [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
   + **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
-  + **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)

 pipeline_tag: feature-extraction
 ---
 # SARM: Interpretable Reward Model via Sparse Autoencoder
+This repository contains the model weights of the AAAI 2026 Oral Paper "*Interpretable Reward Model via Sparse Autoencoder*".
+We release **Llama-SARM-4B-PostSAEPretrain**, which has an identical architecture to Llama-SARM-4B:
+- **Backbone:** Initialized from the first 16 decoder layers of Llama-3.1-8B-Instruct.
+- **SAE encoder:** Initialized from the pretrained TopK SAE at layer 16 (latent size 65,536, Top-K = 192).
+- **SAE decoder:** Not used in the current forward pass, but kept for potential future use.
+- **Score head:** Left untrained for reproducibility and initialized to all zeros to facilitate interpretability and downstream customization.
+## 🔥 News
+- [2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. 🎉
+- [2025/12/11] Llama-SARM-4B is ranked 18th on the [Reward Bench 2](https://huggingface.co/spaces/allenai/reward-bench) leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!🎉
+## 🔗 Links
+  + **Authors**
+    Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang†
+  + **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
   + **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
+  + **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
+## 📧 Contact
+If you have any questions, please feel free to reach us at `shuyizhang@mail.ustc.edu.cn`.
+## 📚 Citation
+If you find our work useful, please cite it as follows.
+```bibtex
+@article{zhang2025interpretable,
+  title={Interpretable Reward Model via Sparse Autoencoder},
+  author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
+  journal={arXiv preprint arXiv:2508.08746},
+  year={2025}
+}
+```