THU-KEG
/

WildReward-8B

@@ -1,19 +1,25 @@
 ---
-license: apache-2.0
 base_model:
 - Qwen/Qwen3-8B
 library_name: transformers
 tags:
 - reward-model
 - rlhf
 - dpo
 - alignment
 - wildchat
 ---
 # WildReward
-WildReward is a reward model trained on in-the-wild human-LLM interactions from the WildChat dataset. Unlike conventional reward models that rely on expensive human-annotated preference pairs, WildReward extracts implicit reward signals from real-world user feedback through an automated pipeline.
 ## Model Details
@@ -96,12 +102,6 @@ with torch.no_grad():
 print(f"Reward score: {reward:.2f} (scale: 1-5)")
 ```
-**Architecture:**
-- Router on port 9000 with round-robin load balancing
-- Multiple workers on dedicated GPUs (ports 8004-8007)
-- FP16 inference with batch processing
 ## Performance
 WildReward achieves competitive results on standard reward model benchmarks while demonstrating superior calibration properties. When applied to Online DPO, it significantly improves performance in mathematical reasoning, instruction following, and creative writing tasks.
@@ -109,6 +109,12 @@ WildReward achieves competitive results on standard reward model benchmarks whil
 ## Citation
 ```bibtex
 ```
 ## License
@@ -117,4 +123,4 @@ Apache License 2.0
 ---
-**Note:** This model card provides a brief overview. For detailed documentation on data collection, training, and deployment, please visit the [GitHub repository](https://github.com/yourusername/WildReward).

 ---
 base_model:
 - Qwen/Qwen3-8B
 library_name: transformers
+license: apache-2.0
+pipeline_tag: text-classification
+datasets:
+- THU-KEG/WildFB
 tags:
 - reward-model
 - rlhf
 - dpo
 - alignment
 - wildchat
+arxiv: 2602.08829
 ---
 # WildReward
+WildReward is a reward model presented in the paper [WildReward: Learning Reward Models from In-the-Wild Human Interactions](https://huggingface.co/papers/2602.08829).
+It is trained on in-the-wild human-LLM interactions from the WildChat dataset. Unlike conventional reward models that rely on expensive human-annotated preference pairs, WildReward extracts implicit reward signals from real-world user feedback through an automated pipeline.
 ## Model Details
 print(f"Reward score: {reward:.2f} (scale: 1-5)")
 ```
 ## Performance
 WildReward achieves competitive results on standard reward model benchmarks while demonstrating superior calibration properties. When applied to Online DPO, it significantly improves performance in mathematical reasoning, instruction following, and creative writing tasks.
 ## Citation
 ```bibtex
+@article{peng2026wildreward,
+  title={WildReward: Learning Reward Models from In-the-Wild Human Interactions},
+  author={Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Yao, Zijun and Hou, Lei and Li, Juanzi},
+  journal={arXiv preprint arXiv:2602.08829},
+  year={2026}
+}
 ```
 ## License
 ---
+**Note:** This model card provides a brief overview. For detailed documentation on data collection, training, and deployment, please visit the [GitHub repository](https://github.com/THU-KEG/WildReward).