Update README.md
Browse files
README.md
CHANGED
|
@@ -36,7 +36,7 @@ model-index:
|
|
| 36 |
# **PCL-Reasoner-V1.5**
|
| 37 |
|
| 38 |
## Model Overview
|
| 39 |
-
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.
|
| 40 |

|
| 41 |
|
| 42 |
|
|
@@ -52,7 +52,7 @@ We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM)
|
|
| 52 |
|
| 53 |
## Evaluation
|
| 54 |
|
| 55 |
-
All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.
|
| 56 |
|
| 57 |
<!-- Table base styling (optional) -->
|
| 58 |
|
|
|
|
| 36 |
# **PCL-Reasoner-V1.5**
|
| 37 |
|
| 38 |
## Model Overview
|
| 39 |
+
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy.
|
| 40 |

|
| 41 |
|
| 42 |
|
|
|
|
| 52 |
|
| 53 |
## Evaluation
|
| 54 |
|
| 55 |
+
All results are reported using the **pass@1 metric** (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.
|
| 56 |
|
| 57 |
<!-- Table base styling (optional) -->
|
| 58 |
|