| license: mit | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| The base Qwen2.5-Math-7B model used by LUFFY, described in [Learning to Reason under Off-Policy Guidance](https://huggingface.co/papers/2504.14945). | |
| We change to rope_theta from 10000 to 40000 and extend the context window to 16k. | |
| Also, we modify the chat_template for the system prompt and add <think>. | |
| Github: https://github.com/ElliottYan/LUFFY | |
| # Citation | |
| If you find our model, data, or evaluation code useful, please kindly cite our paper: | |
| ```bib | |
| @misc{luffy, | |
| title={Learning to Reason under Off-Policy Guidance}, | |
| author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang}, | |
| year={2025}, | |
| eprint={2504.14945}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.LG}, | |
| url={https://arxiv.org/abs/2504.14945}, | |
| } | |
| ``` |