V1.5 / README.md
PCL-Reasoner's picture
Update README.md
77bd521 verified
---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
language:
- en
- zh
pipeline_tag: text-generation
datasets:
- PCL-Reasoner/V1.5-RL-Math
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-32B
tags:
- math
model-index:
- name: PCL-Reasoner/V1.5
results:
- task:
type: text-generation
dataset:
name: Aime24
type: Aime24
metrics:
- name: Aime24
type: Aime24
value: 90.9
- name: Aime25
type: Aime25
value: 85.6
---
# **PCL-Reasoner-V1.5**
## Model Overview
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy.
![Evaluation Results](images/benchmark.png)
## Code
[GitHub Repository](https://github.com/PCL-Reasoner/V1.5)
## RL Dataset
[Huggingface Dataset](https://huggingface.co/datasets/PCL-Reasoner/V1.5-RL-Math)
## Evaluation
All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.
<!-- Table base styling (optional) -->
<style>
table { border-collapse: collapse; width: 100%; margin-left: auto;margin-right: auto;}
th, td { border: 1px solid #ddd; padding: 8px; text-align: center; }
</style>
<!-- Table content -->
<table>
<tr>
<th>Model Scale</th>
<th>Model</th>
<th>AIME 24</th>
<th>AIME 25</th>
</tr>
<!-- Merged row header >100B -->
<tr>
<th rowspan="6">&gt;100B</th>
</tr>
<!-- >100B data rows -->
<tr>
<td>DeepSeek-R1</td>
<td><span style="color:grey">79.8</span></td>
<td><span style="color:grey">70</span></td>
</tr>
<tr>
<td>DeepSeek-R1-0528</td>
<td><span style="color:grey">91.4</span></td>
<td><span style="color:grey">87.5</span></td>
</tr>
<tr>
<td>Qwen3-235B-A22B</td>
<td><span style="color:grey">85.7</span></td>
<td><span style="color:grey">81.5</span></td>
</tr>
<tr>
<td>OpenAI-o3</td>
<td><span style="font-weight: bold;">91.6</span></td>
<td><span style="font-weight: bold;">88.9</span></td>
</tr>
<tr>
<td>Gemini-2.5-Pro-0506</td>
<td><span style="color:grey">90.8</span></td>
<td><span style="color:grey">83</span></td>
</tr>
<!-- Separator row -->
<tr>
<td colspan="4"></td>
</tr>
<!-- Merged row header 32B -->
<tr>
<th rowspan="9">32B</th>
</tr>
<!-- 32B data rows -->
<tr>
<td>Qwen3-32B</td>
<td><span style="color:grey">81.4</span></td>
<td><span style="color:grey">72.9</span></td>
</tr>
<tr>
<td>QwQ-32B</td>
<td><span style="color:grey">79.5</span></td>
<td><span style="color:grey">69.5</span></td>
</tr>
<tr>
<td>DeepSeek-R1-Distill-Qwen-32B</td>
<td><span style="color:grey">72.6</span></td>
<td><span style="color:grey">49.6</span></td>
</tr>
<tr>
<td>Skywork-OR1-32B</td>
<td><span style="color:grey">82.2</span></td>
<td><span style="color:grey">73.3</span></td>
</tr>
<tr>
<td>AM-Thinking-v1</td>
<td><span style="color:grey">85.3</span></td>
<td><span style="color:grey">74.4</span></td>
</tr>
<tr>
<td>OpenReasoning-Nemotron-32B</td>
<td><span style="color:grey">89.2</span></td>
<td><span style="color:grey">84.2</span></td>
</tr>
<tr>
<td>PCL-Reasoner-v1</td>
<td><span style="color:grey">85.7</span></td>
<td><span style="color:grey">84.2</span></td>
</tr>
<tr>
<td>PCL-Reasoner-v1.5</td>
<td><span style="font-weight: bold;">90.9</span></td>
<td><span style="font-weight: bold;">85.7</span></td>
</tr>
</table>
## Citation
```bibtex
@article{PCL-Reasoner-v1.5,
title={PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning},
author={Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian},
journal={arXiv preprint arXiv:2601.14716},
year={2026}
}
```