|
|
--- |
|
|
license: apache-2.0 |
|
|
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- PCL-Reasoner/V1.5-RL-Math |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-32B |
|
|
tags: |
|
|
- math |
|
|
model-index: |
|
|
- name: PCL-Reasoner/V1.5 |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: Aime24 |
|
|
type: Aime24 |
|
|
metrics: |
|
|
- name: Aime24 |
|
|
type: Aime24 |
|
|
value: 90.9 |
|
|
- name: Aime25 |
|
|
type: Aime25 |
|
|
value: 85.6 |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# **PCL-Reasoner-V1.5** |
|
|
|
|
|
## Model Overview |
|
|
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy. |
|
|
 |
|
|
|
|
|
|
|
|
|
|
|
## Code |
|
|
|
|
|
[GitHub Repository](https://github.com/PCL-Reasoner/V1.5) |
|
|
|
|
|
## RL Dataset |
|
|
|
|
|
[Huggingface Dataset](https://huggingface.co/datasets/PCL-Reasoner/V1.5-RL-Math) |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison. |
|
|
|
|
|
<!-- Table base styling (optional) --> |
|
|
|
|
|
<style> |
|
|
table { border-collapse: collapse; width: 100%; margin-left: auto;margin-right: auto;} |
|
|
th, td { border: 1px solid #ddd; padding: 8px; text-align: center; } |
|
|
</style> |
|
|
|
|
|
<!-- Table content --> |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<th>Model Scale</th> |
|
|
<th>Model</th> |
|
|
<th>AIME 24</th> |
|
|
<th>AIME 25</th> |
|
|
</tr> |
|
|
<!-- Merged row header >100B --> |
|
|
<tr> |
|
|
<th rowspan="6">>100B</th> |
|
|
</tr> |
|
|
<!-- >100B data rows --> |
|
|
<tr> |
|
|
<td>DeepSeek-R1</td> |
|
|
<td><span style="color:grey">79.8</span></td> |
|
|
<td><span style="color:grey">70</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>DeepSeek-R1-0528</td> |
|
|
<td><span style="color:grey">91.4</span></td> |
|
|
<td><span style="color:grey">87.5</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Qwen3-235B-A22B</td> |
|
|
<td><span style="color:grey">85.7</span></td> |
|
|
<td><span style="color:grey">81.5</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>OpenAI-o3</td> |
|
|
<td><span style="font-weight: bold;">91.6</span></td> |
|
|
<td><span style="font-weight: bold;">88.9</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Gemini-2.5-Pro-0506</td> |
|
|
<td><span style="color:grey">90.8</span></td> |
|
|
<td><span style="color:grey">83</span></td> |
|
|
</tr> |
|
|
<!-- Separator row --> |
|
|
<tr> |
|
|
<td colspan="4"></td> |
|
|
</tr> |
|
|
<!-- Merged row header 32B --> |
|
|
<tr> |
|
|
<th rowspan="9">32B</th> |
|
|
</tr> |
|
|
<!-- 32B data rows --> |
|
|
<tr> |
|
|
<td>Qwen3-32B</td> |
|
|
<td><span style="color:grey">81.4</span></td> |
|
|
<td><span style="color:grey">72.9</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>QwQ-32B</td> |
|
|
<td><span style="color:grey">79.5</span></td> |
|
|
<td><span style="color:grey">69.5</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>DeepSeek-R1-Distill-Qwen-32B</td> |
|
|
<td><span style="color:grey">72.6</span></td> |
|
|
<td><span style="color:grey">49.6</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Skywork-OR1-32B</td> |
|
|
<td><span style="color:grey">82.2</span></td> |
|
|
<td><span style="color:grey">73.3</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>AM-Thinking-v1</td> |
|
|
<td><span style="color:grey">85.3</span></td> |
|
|
<td><span style="color:grey">74.4</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>OpenReasoning-Nemotron-32B</td> |
|
|
<td><span style="color:grey">89.2</span></td> |
|
|
<td><span style="color:grey">84.2</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>PCL-Reasoner-v1</td> |
|
|
<td><span style="color:grey">85.7</span></td> |
|
|
<td><span style="color:grey">84.2</span></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>PCL-Reasoner-v1.5</td> |
|
|
<td><span style="font-weight: bold;">90.9</span></td> |
|
|
<td><span style="font-weight: bold;">85.7</span></td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{PCL-Reasoner-v1.5, |
|
|
title={PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning}, |
|
|
author={Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian}, |
|
|
journal={arXiv preprint arXiv:2601.14716}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
|