Papers
arxiv:2601.14716

PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

Published on Jan 21
Authors:
,
,
,
,
,
,

Abstract

A 32-billion-parameter large language model for mathematical reasoning is developed using supervised fine-tuning and a novel offline reinforcement learning method, achieving state-of-the-art performance on advanced math competitions.

AI-generated summary

We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.14716 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.