File size: 7,343 Bytes
ea1e3cf 8ef8d60 3f18a65 8ef8d60 3f18a65 4d4e361 e010c81 7df4401 4d4e361 626667e 8ef8d60 d30a73e 626667e 4d4e361 56d3efa 4d4e361 9ba09c8 bb29c9d 4d4e361 8ef8d60 4d4e361 25257b0 4d4e361 0944f64 5ecfc16 bd25c8d 5ecfc16 bd25c8d 8ef8d60 bd25c8d 0944f64 46de3f8 0944f64 46de3f8 0944f64 46de3f8 0944f64 46de3f8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
license: apache-2.0
language:
- en
base_model:
- Kwai-Klear/Klear-Reasoner-8B-SFT
datasets:
- Kwai-Klear/KlearReasoner-MathSub-30K
- Kwai-Klear/KlearReasoner-CodeSub-15K
metrics:
- accuracy
---
# โจ Klear-Reasoner-8B
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
| Resource | Link |
|---|---|
| ๐ Preprints | [Paper](https://arxiv.org/pdf/2508.07629) |
| ๐ค Daily Paper | [Paper](https://huggingface.co/papers/2508.07629) |
| ๐ค Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
| ๐ค Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
| ๐ค Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
| ๐ Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
| ๐ง Contact | suzhenpeng13@163.com |
## ๐ Overview
<div align="center">
<img src="main_result.png" width="100%"/>
<sub>Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8).</sub>
</div>
Klear-Reasoner is an 8-billion-parameter reasoning model that achieves **SOTA** performance on challenging **math and coding benchmarks**:
| Benchmark | AIME 2024 | AIME 2025 | LiveCodeBench V5 | LiveCodeBench V6 |
|---|---|---|---|---|
| **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
The model combines:
1. **Quality-centric long CoT SFT** โ distilled from DeepSeek-R1-0528.
2. **Gradient-Preserving Clipping Policy Optimization (GPPO)** โ a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
---
### Evaluation
When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. **Evaluation is coming soon, stay tuned.**
## ๐ Benchmark Results (Pass@1)
| Model | AIME2024<br>avg@64 | AIME2025<br>avg@64 | HMMT2025<br>avg@64 | LCB V5<br>avg@8 | LCB V6<br>avg@8 |
|-------|--------------------|--------------------|--------------------|-----------------|-----------------|
| AReal-boba-RL-7B | 61.9 | 48.3 | 29.4 | 34.3 | 31.0โ |
| MiMo-7B-RL | 68.2 | 55.4 | 35.7 | 57.8 | 49.3 |
| Skywork-OR1-7B | 70.2 | 54.6 | 35.7 | 47.6 | 42.7 |
| AceReason-Nemotron-1.1-7B | 72.6 | 64.8 | 42.9 | 57.2 | 52.1 |
| POLARIS-4B-Preview | 81.2 | _79.4_ | 58.7 | 58.5โ | 53.0โ |
| Qwen3-8B | 76.0 | 67.3 | 44.7โ | 57.5 | 48.4โ |
| Deepseek-R1-0528-Distill-8B | _86.0_ | 76.3 | 61.5 | 61.0โ | 51.6โ |
| OpenReasoning-Nemotron-7B | 84.7 | 78.2 | 63.5 | _65.6_โ | _56.3_โ |
| Klear-Reasoner-8B-SFT | 75.6 | 70.1 | 57.6 | 58.5 | 49.6 |
| Klear-Reasoner-8B | 83.2 | 75.6 | 60.3 | 61.6 | 53.1 |
| *w/ 64K Inference Budget* | **90.5** | **83.2** | **70.8** | **66.0** | **58.1** |
> We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
---
## ๐งช Training
### Configure the experimental environment
```bash
git clone https://github.com/Kwai-Klear990901/Klear_Reasoner
cd Klear_Reasoner
pip install -r requirements.txt
```
For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
### Using Ray for Multi-Node Training
For multi-node trainingโโ, ensure โโall nodes are started and connected via Rayโโ before executing the training script. Below is a brief setup guide for Ray across multiple machines:
#### Step 1: Start Ray on the Head Node (node0)
On the first node (typically called `node0`), run:
```bash
ray start --head --dashboard-host=0.0.0.0
```
Get the IP address of the master node.
```bash
MASTER_IP=$(hostname -I | awk '{print $1}')
```
#### Step 2: Connect Other Nodes (e.g., node1)
On each additional worker node (e.g., `node1`), run the following, replacing the IP with that of your head node:
```bash
ray start --address=\"$MASTER_IP:6379\"
```
### RL Training
Run the following script on the master node to start the training task.
```bash
bash recipe/dapo/perf_run_dapo_ours_math.sh # For Math RL
bash recipe/dapo/perf_run_dapo_ours_code.sh # For Code RL
```
In the startup script, you need to set the following variables:
```bash
YOUR_MODEL_PATH="<your_model_path>"
CKPTS_SAVE_DIR="<ckpts_save_path>"
YOUR_TRAIN_FILE="<train_data_path>"
YOUR_TEST_FILE="<test_data_path>"
```
### Evaluation
When we expand the inference budget to 64K and adopt **the YaRN method with a scaling factor of 2.5**.
The evaluation data for AIME24, AIME25, and HMMT2025 are available in our GitHub repository under the **benchmarks directory**.
For LiveCodeBench, please download the data from the official website.
You can run the following commands to perform inference and evaluation:
```bash
git clone https://github.com/Kwai-Klear990901/KlearReasoner
cd KlearReasoner/benchmarks
python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
python judge_math.py <path_to_inference_results>
```
## ๐ค Citation
If you find this work helpful, please cite our paper:
```bibtex
@misc{su2025cegppocontrollingentropygradientpreserving,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2509.20712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.20712},
}
```
```bibtex
@article{DBLP:journals/corr/abs-2508-07629,
author = {Zhenpeng Su and
Leiyu Pan and
Xue Bai and
Dening Liu and
Guanting Dong and
Jiaming Huang and
Wenping Hu and
Fuzheng Zhang and
Kun Gai and
Guorui Zhou},
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization},
journal = {CoRR},
volume = {abs/2508.07629},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2508.07629},
doi = {10.48550/ARXIV.2508.07629},
eprinttype = {arXiv},
eprint = {2508.07629},
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2508-07629.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
|