PCL-Reasoner
/

V1.5

@@ -35,13 +35,10 @@ model-index:
 # **PCL-Reasoner-V1.5**
 ## Model Overview
-We release **PCL-Reasoner-V1.5**, a 32B reasoning model built upon **PCL-Reasoner-V1** and further enhanced through **offline reinforcement learning** method on the **vllm-ascend** and **MindSpeed-LLM framework** with **Ascend hardware acceleration**. Building on the strong foundation of PCL-Reasoner-V1, PCL-Reasoner-V1.5 achieves even greater improvement in complex mathematical reasoning with long chains of thought (CoT), demonstrating state-of-the-art performance among 32B-scale models.
-PCL-Reasoner-V1.5 attains **90.9% on AIME 2024** and **85.7% on AIME 2025**, significantly outperforming prior 32B-class models and closing the gap with much larger systems. This advancement stems from refined data curation, improved contamination filtering, and optimized training dynamics tailored for deep reasoning tasks.
 ![Evaluation Results](images/benchmark.png)
-We have fully open-sourced the **model weights**, **dataset**, and **training code** to foster transparency, reproducibility, and community innovation. Follow the tutorial below to deploy, evaluate, or extend PCL-Reasoner-V1.5 in your own research!
 ## Code
@@ -142,8 +139,8 @@ All results are reported using the **Avg@32 metric** (average accuracy over 32 i
   </tr>
   <tr>
     <td>PCL-Reasoner-v1</td>
-    <td><p style="font-weight:grey;">85.7</p></td>
-    <td><p style="font-weight:grey;">84.2</p></td>
   </tr>
   <tr>
     <td>PCL-Reasoner-v1.5</td>
@@ -158,9 +155,9 @@ All results are reported using the **Avg@32 metric** (average accuracy over 32 i
 ```bibtex
 @article{PCL-Reasoner-v1.5,
-  title={PCL-Reasoner-v1.5: A Math Problem Solver with Chain of Thought Reasoning},
-  author={Yao Lu, Deng Dong Fan, Jianzheng Nie, et al.},
-  journal={arXiv preprint arXiv:2405.14524},
   year={2026}
 }
 ```

 # **PCL-Reasoner-V1.5**
 ## Model Overview
+We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.
 ![Evaluation Results](images/benchmark.png)
 ## Code
   </tr>
   <tr>
     <td>PCL-Reasoner-v1</td>
+    <td><p style="color:grey">85.7</p></td>
+    <td><p style="color:grey">84.2</p></td>
   </tr>
   <tr>
     <td>PCL-Reasoner-v1.5</td>
 ```bibtex
 @article{PCL-Reasoner-v1.5,
+  title={PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning},
+  author={Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian},
+  journal={arXiv preprint arXiv:2601.14716},
   year={2026}
 }
 ```