Improve model card: Add pipeline tag, library name, and update CE-GPPO paper/code links
#3
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,28 +1,31 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- Kwai-Klear/Klear-Reasoner-8B-SFT
|
| 7 |
datasets:
|
| 8 |
- Kwai-Klear/KlearReasoner-MathSub-30K
|
| 9 |
- Kwai-Klear/KlearReasoner-CodeSub-15K
|
|
|
|
|
|
|
|
|
|
| 10 |
metrics:
|
| 11 |
- accuracy
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
-
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
|
| 17 |
|
| 18 |
| Resource | Link |
|
| 19 |
|---|---|
|
| 20 |
-
|
|
| 21 |
-
|
|
| 22 |
| ๐ค Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
|
| 23 |
| ๐ค Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
|
| 24 |
| ๐ค Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
|
| 25 |
-
| ๐ Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
|
| 26 |
| ๐ง Contact | suzhenpeng13@163.com |
|
| 27 |
|
| 28 |
## ๐ Overview
|
|
@@ -40,8 +43,8 @@ Klear-Reasoner is an 8-billion-parameter reasoning model that achieves **SOTA**
|
|
| 40 |
| **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
|
| 41 |
|
| 42 |
The model combines:
|
| 43 |
-
1.
|
| 44 |
-
2.
|
| 45 |
|
| 46 |
---
|
| 47 |
|
|
@@ -66,18 +69,21 @@ When we expand the inference budget to 64K and adopt the YaRN method with a scal
|
|
| 66 |
|
| 67 |
> We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
|
| 68 |
|
| 69 |
-
|
| 70 |
---
|
| 71 |
|
| 72 |
## ๐งช Training
|
| 73 |
### Configure the experimental environment
|
| 74 |
```bash
|
| 75 |
-
git clone https://github.com/Kwai-
|
| 76 |
-
cd
|
|
|
|
| 77 |
pip install -r requirements.txt
|
| 78 |
```
|
| 79 |
For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
|
| 80 |
|
|
|
|
|
|
|
|
|
|
| 81 |
### Using Ray for Multi-Node Training
|
| 82 |
For multi-node trainingโโ, ensure โโall nodes are started and connected via Rayโโ before executing the training script. Below is a brief setup guide for Ray across multiple machines:
|
| 83 |
#### Step 1: Start Ray on the Head Node (node0)
|
|
@@ -124,22 +130,23 @@ For LiveCodeBench, please download the data from the official website.
|
|
| 124 |
|
| 125 |
You can run the following commands to perform inference and evaluation:
|
| 126 |
```bash
|
| 127 |
-
git clone https://github.com/Kwai-
|
| 128 |
-
cd
|
| 129 |
python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
|
| 130 |
python judge_math.py <path_to_inference_results>
|
| 131 |
```
|
| 132 |
|
|
|
|
| 133 |
## ๐ค Citation
|
| 134 |
If you find this work helpful, please cite our paper:
|
| 135 |
```bibtex
|
| 136 |
-
@misc{
|
| 137 |
-
title={
|
| 138 |
-
author={Zhenpeng Su and Leiyu Pan and
|
| 139 |
year={2025},
|
| 140 |
-
eprint={
|
| 141 |
archivePrefix={arXiv},
|
| 142 |
primaryClass={cs.LG},
|
| 143 |
-
url={https://arxiv.org/abs/
|
| 144 |
}
|
| 145 |
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Kwai-Klear/Klear-Reasoner-8B-SFT
|
| 4 |
datasets:
|
| 5 |
- Kwai-Klear/KlearReasoner-MathSub-30K
|
| 6 |
- Kwai-Klear/KlearReasoner-CodeSub-15K
|
| 7 |
+
language:
|
| 8 |
+
- en
|
| 9 |
+
license: apache-2.0
|
| 10 |
metrics:
|
| 11 |
- accuracy
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
library_name: transformers
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# โจ Klear-Reasoner-8B: Advancing Reasoning Capability via CE-GPPO
|
| 17 |
+
|
| 18 |
+
This repository contains the `Klear-Reasoner-8B` model, a powerful reasoning model that implements innovations from the paper **[CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712)**.
|
| 19 |
|
| 20 |
+
**CE-GPPO** introduces a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. This approach effectively mitigates entropy instability and consistently outperforms strong baselines across different model scales on mathematical reasoning benchmarks.
|
|
|
|
| 21 |
|
| 22 |
| Resource | Link |
|
| 23 |
|---|---|
|
| 24 |
+
| ๐ Paper | [CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712) |
|
| 25 |
+
| ๐งโ๐ป Code & Issues | [GitHub: Kwai-Klear/CE-GPPO](https://github.com/Kwai-Klear/CE-GPPO) |
|
| 26 |
| ๐ค Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
|
| 27 |
| ๐ค Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
|
| 28 |
| ๐ค Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
|
|
|
|
| 29 |
| ๐ง Contact | suzhenpeng13@163.com |
|
| 30 |
|
| 31 |
## ๐ Overview
|
|
|
|
| 43 |
| **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
|
| 44 |
|
| 45 |
The model combines:
|
| 46 |
+
1. **Quality-centric long CoT SFT** โ distilled from DeepSeek-R1-0528.
|
| 47 |
+
2. **Gradient-Preserving Clipping Policy Optimization (CE-GPPO)** โ a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
|
| 48 |
|
| 49 |
---
|
| 50 |
|
|
|
|
| 69 |
|
| 70 |
> We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
|
| 71 |
|
|
|
|
| 72 |
---
|
| 73 |
|
| 74 |
## ๐งช Training
|
| 75 |
### Configure the experimental environment
|
| 76 |
```bash
|
| 77 |
+
git clone https://github.com/Kwai-Klear/CE-GPPO
|
| 78 |
+
cd CE-GPPO
|
| 79 |
+
pip install -e .
|
| 80 |
pip install -r requirements.txt
|
| 81 |
```
|
| 82 |
For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
|
| 83 |
|
| 84 |
+
### Download a pre-trained checkpoint & data
|
| 85 |
+
We trained our model based on [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), using the [KlearReasoner-MathSub-30K](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) dataset for training, with [AIME2024](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify.json) and [AIME2025](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify25.json) as the validation sets.
|
| 86 |
+
|
| 87 |
### Using Ray for Multi-Node Training
|
| 88 |
For multi-node trainingโโ, ensure โโall nodes are started and connected via Rayโโ before executing the training script. Below is a brief setup guide for Ray across multiple machines:
|
| 89 |
#### Step 1: Start Ray on the Head Node (node0)
|
|
|
|
| 130 |
|
| 131 |
You can run the following commands to perform inference and evaluation:
|
| 132 |
```bash
|
| 133 |
+
git clone https://github.com/Kwai-Klear/CE-GPPO
|
| 134 |
+
cd CE-GPPO/benchmarks
|
| 135 |
python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
|
| 136 |
python judge_math.py <path_to_inference_results>
|
| 137 |
```
|
| 138 |
|
| 139 |
+
---
|
| 140 |
## ๐ค Citation
|
| 141 |
If you find this work helpful, please cite our paper:
|
| 142 |
```bibtex
|
| 143 |
+
@misc{su2025cegppocontrollingentropygradientpreserving,
|
| 144 |
+
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
|
| 145 |
+
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
|
| 146 |
year={2025},
|
| 147 |
+
eprint={2509.20712},
|
| 148 |
archivePrefix={arXiv},
|
| 149 |
primaryClass={cs.LG},
|
| 150 |
+
url={https://arxiv.org/abs/2509.20712},
|
| 151 |
}
|
| 152 |
```
|