PCL-Reasoner commited on
Commit
77bd521
·
verified ·
1 Parent(s): d0fb994

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -36,7 +36,7 @@ model-index:
36
  # **PCL-Reasoner-V1.5**
37
 
38
  ## Model Overview
39
- We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.
40
  ![Evaluation Results](images/benchmark.png)
41
 
42
 
@@ -52,7 +52,7 @@ We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM)
52
 
53
  ## Evaluation
54
 
55
- All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy.
56
 
57
  <!-- Table base styling (optional) -->
58
 
 
36
  # **PCL-Reasoner-V1.5**
37
 
38
  ## Model Overview
39
+ We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy.
40
  ![Evaluation Results](images/benchmark.png)
41
 
42
 
 
52
 
53
  ## Evaluation
54
 
55
+ All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.
56
 
57
  <!-- Table base styling (optional) -->
58