V1.5

File size: 4,608 Bytes

---
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
language:
- en
- zh
pipeline_tag: text-generation
datasets:
- PCL-Reasoner/V1.5-RL-Math
metrics:
- accuracy
base_model:
- Qwen/Qwen2.5-32B
tags:
- math
model-index:
- name: PCL-Reasoner/V1.5
  results:
  - task:
      type: text-generation
    dataset:
      name: Aime24
      type: Aime24
    metrics:
    - name: Aime24
      type: Aime24
      value: 90.9
    - name: Aime25
      type: Aime25
      value: 85.6
---




# **PCL-Reasoner-V1.5**

## Model Overview  
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy.
![Evaluation Results](images/benchmark.png)



## Code

[GitHub Repository](https://github.com/PCL-Reasoner/V1.5)

## RL Dataset

[Huggingface Dataset](https://huggingface.co/datasets/PCL-Reasoner/V1.5-RL-Math)


## Evaluation  

All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.

<!-- Table base styling (optional) -->

<style>
  table { border-collapse: collapse; width: 100%; margin-left: auto;margin-right: auto;}
  th, td { border: 1px solid #ddd; padding: 8px; text-align: center; }
</style>

<!-- Table content -->

<table>
  <tr>
    <th>Model Scale</th>
    <th>Model</th>
    <th>AIME 24</th>
    <th>AIME 25</th>
  </tr>
  <!-- Merged row header >100B -->
  <tr>
    <th rowspan="6">&gt;100B</th>
  </tr>
  <!-- >100B data rows -->
  <tr>
    <td>DeepSeek-R1</td>
    <td><span style="color:grey">79.8</span></td>
    <td><span style="color:grey">70</span></td>
  </tr>
  <tr>
    <td>DeepSeek-R1-0528</td>
    <td><span style="color:grey">91.4</span></td>
    <td><span style="color:grey">87.5</span></td>
  </tr>
  <tr>
    <td>Qwen3-235B-A22B</td>
    <td><span style="color:grey">85.7</span></td>
    <td><span style="color:grey">81.5</span></td>
  </tr>
  <tr>
    <td>OpenAI-o3</td>
    <td><span style="font-weight: bold;">91.6</span></td>
    <td><span style="font-weight: bold;">88.9</span></td>
  </tr>
  <tr>
    <td>Gemini-2.5-Pro-0506</td>
    <td><span style="color:grey">90.8</span></td>
    <td><span style="color:grey">83</span></td>
  </tr>
  <!-- Separator row -->
  <tr>
    <td colspan="4"></td>
  </tr>
  <!-- Merged row header 32B -->
  <tr>
    <th rowspan="9">32B</th>
  </tr>
  <!-- 32B data rows -->
  <tr>
    <td>Qwen3-32B</td>
    <td><span style="color:grey">81.4</span></td>
    <td><span style="color:grey">72.9</span></td>
  </tr>
  <tr>
    <td>QwQ-32B</td>
    <td><span style="color:grey">79.5</span></td> 
    <td><span style="color:grey">69.5</span></td>
  </tr>
  <tr>
    <td>DeepSeek-R1-Distill-Qwen-32B</td>
    <td><span style="color:grey">72.6</span></td>
    <td><span style="color:grey">49.6</span></td> 
  </tr>
  <tr>
    <td>Skywork-OR1-32B</td>
    <td><span style="color:grey">82.2</span></td>
    <td><span style="color:grey">73.3</span></td>
  </tr>
  <tr>
    <td>AM-Thinking-v1</td>
    <td><span style="color:grey">85.3</span></td>
    <td><span style="color:grey">74.4</span></td>
  </tr>
  <tr>
    <td>OpenReasoning-Nemotron-32B</td>
    <td><span style="color:grey">89.2</span></td>
    <td><span style="color:grey">84.2</span></td>
  </tr>
  <tr>
    <td>PCL-Reasoner-v1</td>
    <td><span style="color:grey">85.7</span></td> 
    <td><span style="color:grey">84.2</span></td> 
  </tr>
  <tr>
    <td>PCL-Reasoner-v1.5</td>
    <td><span style="font-weight: bold;">90.9</span></td> 
    <td><span style="font-weight: bold;">85.7</span></td> 
  </tr>
</table>


## Citation

```bibtex
@article{PCL-Reasoner-v1.5,
  title={PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning},
  author={Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian},
  journal={arXiv preprint arXiv:2601.14716},
  year={2026}
}
```