V1.5 / README.md
PCL-Reasoner's picture
Update README.md
77bd521 verified
metadata
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
language:
  - en
  - zh
pipeline_tag: text-generation
datasets:
  - PCL-Reasoner/V1.5-RL-Math
metrics:
  - accuracy
base_model:
  - Qwen/Qwen2.5-32B
tags:
  - math
model-index:
  - name: PCL-Reasoner/V1.5
    results:
      - task:
          type: text-generation
        dataset:
          name: Aime24
          type: Aime24
        metrics:
          - name: Aime24
            type: Aime24
            value: 90.9
          - name: Aime25
            type: Aime25
            value: 85.6

PCL-Reasoner-V1.5

Model Overview

We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy. Evaluation Results

Code

GitHub Repository

RL Dataset

Huggingface Dataset

Evaluation

All results are reported using the pass@1 metric (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.

Model Scale Model AIME 24 AIME 25
>100B
DeepSeek-R1 79.8 70
DeepSeek-R1-0528 91.4 87.5
Qwen3-235B-A22B 85.7 81.5
OpenAI-o3 91.6 88.9
Gemini-2.5-Pro-0506 90.8 83
32B
Qwen3-32B 81.4 72.9
QwQ-32B 79.5 69.5
DeepSeek-R1-Distill-Qwen-32B 72.6 49.6
Skywork-OR1-32B 82.2 73.3
AM-Thinking-v1 85.3 74.4
OpenReasoning-Nemotron-32B 89.2 84.2
PCL-Reasoner-v1 85.7 84.2
PCL-Reasoner-v1.5 90.9 85.7

Citation

@article{PCL-Reasoner-v1.5,
  title={PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning},
  author={Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian},
  journal={arXiv preprint arXiv:2601.14716},
  year={2026}
}