V1.5 / README.md

Update README.md

77bd521 verified 8 days ago

4.61 kB

	---
	license: apache-2.0
	license_link: https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/LICENSE
	language:
	- en
	- zh
	pipeline_tag: text-generation
	datasets:
	- PCL-Reasoner/V1.5-RL-Math
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen2.5-32B
	tags:
	- math
	model-index:
	- name: PCL-Reasoner/V1.5
	results:
	- task:
	type: text-generation
	dataset:
	name: Aime24
	type: Aime24
	metrics:
	- name: Aime24
	type: Aime24
	value: 90.9
	- name: Aime25
	type: Aime25
	value: 85.6
	---




	# PCL-Reasoner-V1.5

	## Model Overview
	We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs. Both training and evaluation processes utilize FP16 precision to maintain numerical accuracy.
	![Evaluation Results](images/benchmark.png)



	## Code

	[GitHub Repository](https://github.com/PCL-Reasoner/V1.5)

	## RL Dataset

	[Huggingface Dataset](https://huggingface.co/datasets/PCL-Reasoner/V1.5-RL-Math)


	## Evaluation

	All results are reported using the pass@1 metric (averaged over 32 independent sampling attempts per problem), ensuring robust and fair comparison.

	<!-- Table base styling (optional) -->

	<style>
	table { border-collapse: collapse; width: 100%; margin-left: auto;margin-right: auto;}
	th, td { border: 1px solid #ddd; padding: 8px; text-align: center; }
	</style>

	<!-- Table content -->

	<table>
	<tr>
	<th>Model Scale</th>
	<th>Model</th>
	<th>AIME 24</th>
	<th>AIME 25</th>
	</tr>
	<!-- Merged row header >100B -->
	<tr>
	<th rowspan="6">>100B</th>
	</tr>
	<!-- >100B data rows -->
	<tr>
	<td>DeepSeek-R1</td>
	<td><span style="color:grey">79.8</span></td>
	<td><span style="color:grey">70</span></td>
	</tr>
	<tr>
	<td>DeepSeek-R1-0528</td>
	<td><span style="color:grey">91.4</span></td>
	<td><span style="color:grey">87.5</span></td>
	</tr>
	<tr>
	<td>Qwen3-235B-A22B</td>
	<td><span style="color:grey">85.7</span></td>
	<td><span style="color:grey">81.5</span></td>
	</tr>
	<tr>
	<td>OpenAI-o3</td>
	<td><span style="font-weight: bold;">91.6</span></td>
	<td><span style="font-weight: bold;">88.9</span></td>
	</tr>
	<tr>
	<td>Gemini-2.5-Pro-0506</td>
	<td><span style="color:grey">90.8</span></td>
	<td><span style="color:grey">83</span></td>
	</tr>
	<!-- Separator row -->
	<tr>
	<td colspan="4"></td>
	</tr>
	<!-- Merged row header 32B -->
	<tr>
	<th rowspan="9">32B</th>
	</tr>
	<!-- 32B data rows -->
	<tr>
	<td>Qwen3-32B</td>
	<td><span style="color:grey">81.4</span></td>
	<td><span style="color:grey">72.9</span></td>
	</tr>
	<tr>
	<td>QwQ-32B</td>
	<td><span style="color:grey">79.5</span></td>
	<td><span style="color:grey">69.5</span></td>
	</tr>
	<tr>
	<td>DeepSeek-R1-Distill-Qwen-32B</td>
	<td><span style="color:grey">72.6</span></td>
	<td><span style="color:grey">49.6</span></td>
	</tr>
	<tr>
	<td>Skywork-OR1-32B</td>
	<td><span style="color:grey">82.2</span></td>
	<td><span style="color:grey">73.3</span></td>
	</tr>
	<tr>
	<td>AM-Thinking-v1</td>
	<td><span style="color:grey">85.3</span></td>
	<td><span style="color:grey">74.4</span></td>
	</tr>
	<tr>
	<td>OpenReasoning-Nemotron-32B</td>
	<td><span style="color:grey">89.2</span></td>
	<td><span style="color:grey">84.2</span></td>
	</tr>
	<tr>
	<td>PCL-Reasoner-v1</td>
	<td><span style="color:grey">85.7</span></td>
	<td><span style="color:grey">84.2</span></td>
	</tr>
	<tr>
	<td>PCL-Reasoner-v1.5</td>
	<td><span style="font-weight: bold;">90.9</span></td>
	<td><span style="font-weight: bold;">85.7</span></td>
	</tr>
	</table>


	## Citation

	```bibtex
	@article{PCL-Reasoner-v1.5,
	title={PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning},
	author={Yao Lu, Dengdong Fan, Jianzheng Nie, Fan Xu, Jie Chen, Bin Zhou, Yonghong Tian},
	journal={arXiv preprint arXiv:2601.14716},
	year={2026}
	}
	```