Elliott commited on
Commit
a546ca4
·
verified ·
1 Parent(s): 40ba96d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -2
README.md CHANGED
@@ -1,6 +1,20 @@
1
  ---
2
- {}
3
  ---
4
  The base Qwen2.5-Math-7B model used by LUFFY.
5
  We change to rope_theta from 10000 to 40000 and extend the context window to 16k.
6
- Also, we modify the chat_template for the system prompt and add <think>.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
  ---
4
  The base Qwen2.5-Math-7B model used by LUFFY.
5
  We change to rope_theta from 10000 to 40000 and extend the context window to 16k.
6
+ Also, we modify the chat_template for the system prompt and add <think>.
7
+
8
+ # Citation
9
+ If you find our model, data, or evaluation code useful, please kindly cite our paper:
10
+ ```bib
11
+ @misc{luffy,
12
+ title={Learning to Reason under Off-Policy Guidance},
13
+ author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
14
+ year={2025},
15
+ eprint={2504.14945},
16
+ archivePrefix={arXiv},
17
+ primaryClass={cs.LG},
18
+ url={https://arxiv.org/abs/2504.14945},
19
+ }
20
+ ```