caiyuchen
/

DAPO-step-21

@@ -18,8 +18,7 @@ base_model:
 ## 🔧 Prompt Format (Chat Template)
 During RL training and inference, each question is formatted as:
-{{question}}
-Please reason step by step, and put your final answer within boxed{{}}
 Then wrapped using the chat template:
@@ -55,3 +54,21 @@ inputs = tokenizer(prompt, return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))

 ## 🔧 Prompt Format (Chat Template)
 During RL training and inference, each question is formatted as:
+{question} Please reason step by step, and put your final answer within boxed{}.
 Then wrapped using the chat template:
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+## 📎 Reference
+If you find this model useful, please consider citing our paper:
+🔗 **Paper Link**: https://huggingface.co/papers/2510.00553
+```bibtex
+@misc{cai2025predictabilityreinforcementlearningdynamics,
+      title={On Predictability of Reinforcement Learning Dynamics for Large Language Models},
+      author={Yuchen Cai and Ding Cao and Xin Xu and Zijun Yao and Yuqing Huang and Zhenyu Tan and Benyi Zhang and Guiquan Liu and Junfeng Fang},
+      year={2025},
+      eprint={2510.00553},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2510.00553},
+}