Improve model card: Add pipeline tag, library name, and update CE-GPPO paper/code links

f5548f1 verified 5 months ago

7.33 kB

	---
	base_model:
	- Kwai-Klear/Klear-Reasoner-8B-SFT
	datasets:
	- Kwai-Klear/KlearReasoner-MathSub-30K
	- Kwai-Klear/KlearReasoner-CodeSub-15K
	language:
	- en
	license: apache-2.0
	metrics:
	- accuracy
	pipeline_tag: text-generation
	library_name: transformers
	---

	# ✨ Klear-Reasoner-8B: Advancing Reasoning Capability via CE-GPPO

	This repository contains the `Klear-Reasoner-8B` model, a powerful reasoning model that implements innovations from the paper [CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712).

	CE-GPPO introduces a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. This approach effectively mitigates entropy instability and consistently outperforms strong baselines across different model scales on mathematical reasoning benchmarks.

	\| Resource \| Link \|
	\|---\|---\|
	\| 📄 Paper \| [CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712) \|
	\| 🧑‍💻 Code & Issues \| [GitHub: Kwai-Klear/CE-GPPO](https://github.com/Kwai-Klear/CE-GPPO) \|
	\| 🤗 Model Hub \| [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) \|
	\| 🤗 Dataset Hub \| [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) \|
	\| 🤗 Dataset Hub \| [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) \|
	\| 📧 Contact \| suzhenpeng13@163.com \|

	## 📌 Overview

	<div align="center">
	<img src="main_result.png" width="100%"/>

	<sub>Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8).</sub>
	</div>

	Klear-Reasoner is an 8-billion-parameter reasoning model that achieves SOTA performance on challenging math and coding benchmarks:

	\| Benchmark \| AIME 2024 \| AIME 2025 \| LiveCodeBench V5 \| LiveCodeBench V6 \|
	\|---\|---\|---\|---\|---\|
	\| Score \| 90.5 % \| 83.2 % \| 66.0 % \| 58.1 % \|

	The model combines:
	1. Quality-centric long CoT SFT – distilled from DeepSeek-R1-0528.
	2. Gradient-Preserving Clipping Policy Optimization (CE-GPPO) – a novel RL method that keeps gradients from clipped tokens to boost exploration & convergence.

	---

	### Evaluation
	When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. Evaluation is coming soon, stay tuned.

	## 📊 Benchmark Results (Pass@1)

	\| Model \| AIME2024<br>avg@64 \| AIME2025<br>avg@64 \| HMMT2025<br>avg@64 \| LCB V5<br>avg@8 \| LCB V6<br>avg@8 \|
	\|-------\|--------------------\|--------------------\|--------------------\|-----------------\|-----------------\|
	\| AReal-boba-RL-7B \| 61.9 \| 48.3 \| 29.4 \| 34.3 \| 31.0† \|
	\| MiMo-7B-RL \| 68.2 \| 55.4 \| 35.7 \| 57.8 \| 49.3 \|
	\| Skywork-OR1-7B \| 70.2 \| 54.6 \| 35.7 \| 47.6 \| 42.7 \|
	\| AceReason-Nemotron-1.1-7B \| 72.6 \| 64.8 \| 42.9 \| 57.2 \| 52.1 \|
	\| POLARIS-4B-Preview \| 81.2 \| _79.4_ \| 58.7 \| 58.5† \| 53.0† \|
	\| Qwen3-8B \| 76.0 \| 67.3 \| 44.7† \| 57.5 \| 48.4† \|
	\| Deepseek-R1-0528-Distill-8B \| _86.0_ \| 76.3 \| 61.5 \| 61.0† \| 51.6† \|
	\| OpenReasoning-Nemotron-7B \| 84.7 \| 78.2 \| 63.5 \| _65.6_† \| _56.3_† \|
	\| Klear-Reasoner-8B-SFT \| 75.6 \| 70.1 \| 57.6 \| 58.5 \| 49.6 \|
	\| Klear-Reasoner-8B \| 83.2 \| 75.6 \| 60.3 \| 61.6 \| 53.1 \|
	\| w/ 64K Inference Budget \| 90.5 \| 83.2 \| 70.8 \| 66.0 \| 58.1 \|

	> We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).

	---

	## 🧪 Training
	### Configure the experimental environment
	```bash
	git clone https://github.com/Kwai-Klear/CE-GPPO
	cd CE-GPPO
	pip install -e .
	pip install -r requirements.txt
	```
	For the code, we use [Firejail](https://github.com/netblue30/firejail) for the sandbox environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.

	### Download a pre-trained checkpoint & data
	We trained our model based on [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), using the [KlearReasoner-MathSub-30K](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) dataset for training, with [AIME2024](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify.json) and [AIME2025](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify25.json) as the validation sets.

	### Using Ray for Multi-Node Training
	For multi-node training, ensure all nodes are started and connected via Ray before executing the training script. Below is a brief setup guide for Ray across multiple machines:
	#### Step 1: Start Ray on the Head Node (node0)

	On the first node (typically called `node0`), run:

	```bash
	ray start --head --dashboard-host=0.0.0.0
	```

	Get the IP address of the master node.
	```bash
	MASTER_IP=$(hostname -I \| awk '{print $1}')
	```
	#### Step 2: Connect Other Nodes (e.g., node1)

	On each additional worker node (e.g., `node1`), run the following, replacing the IP with that of your head node:

	```bash
	ray start --address=\"$MASTER_IP:6379\"
	```

	### RL Training
	Run the following script on the master node to start the training task.

	```bash
	bash recipe/dapo/perf_run_dapo_ours_math.sh # For Math RL
	bash recipe/dapo/perf_run_dapo_ours_code.sh # For Code RL
	```

	In the startup script, you need to set the following variables:
	```bash
	YOUR_MODEL_PATH="<your_model_path>"
	CKPTS_SAVE_DIR="<ckpts_save_path>"
	YOUR_TRAIN_FILE="<train_data_path>"
	YOUR_TEST_FILE="<test_data_path>"
	```

	### Evaluation
	When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5.

	The evaluation data for AIME24, AIME25, and HMMT2025 are available in our GitHub repository under the benchmarks directory.
	For LiveCodeBench, please download the data from the official website.

	You can run the following commands to perform inference and evaluation:
	```bash
	git clone https://github.com/Kwai-Klear/CE-GPPO
	cd CE-GPPO/benchmarks
	python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
	python judge_math.py <path_to_inference_results>
	```

	---
	## 🤝 Citation
	If you find this work helpful, please cite our paper:
	```bibtex
	@misc{su2025cegppocontrollingentropygradientpreserving,
	title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
	author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
	year={2025},
	eprint={2509.20712},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2509.20712},
	}
	```