Safetensors
English
qwen3

Improve model card: Add pipeline tag, library name, and update CE-GPPO paper/code links

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +27 -20
README.md CHANGED
@@ -1,28 +1,31 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Kwai-Klear/Klear-Reasoner-8B-SFT
7
  datasets:
8
  - Kwai-Klear/KlearReasoner-MathSub-30K
9
  - Kwai-Klear/KlearReasoner-CodeSub-15K
 
 
 
10
  metrics:
11
  - accuracy
 
 
12
  ---
13
 
 
 
 
14
 
15
- # โœจ Klear-Reasoner-8B
16
- We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
17
 
18
  | Resource | Link |
19
  |---|---|
20
- | ๐Ÿ“ Preprints | [Paper](https://arxiv.org/pdf/2508.07629) |
21
- | ๐Ÿค— Daily Paper | [Paper](https://huggingface.co/papers/2508.07629) |
22
  | ๐Ÿค— Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
23
  | ๐Ÿค— Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
24
  | ๐Ÿค— Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
25
- | ๐Ÿ› Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
26
  | ๐Ÿ“ง Contact | suzhenpeng13@163.com |
27
 
28
  ## ๐Ÿ“Œ Overview
@@ -40,8 +43,8 @@ Klear-Reasoner is an 8-billion-parameter reasoning model that achieves **SOTA**
40
  | **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
41
 
42
  The model combines:
43
- 1. **Quality-centric long CoT SFT** โ€“ distilled from DeepSeek-R1-0528.
44
- 2. **Gradient-Preserving Clipping Policy Optimization (GPPO)** โ€“ a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
45
 
46
  ---
47
 
@@ -66,18 +69,21 @@ When we expand the inference budget to 64K and adopt the YaRN method with a scal
66
 
67
  > We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
68
 
69
-
70
  ---
71
 
72
  ## ๐Ÿงช Training
73
  ### Configure the experimental environment
74
  ```bash
75
- git clone https://github.com/Kwai-Klear990901/Klear_Reasoner
76
- cd Klear_Reasoner
 
77
  pip install -r requirements.txt
78
  ```
79
  For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
80
 
 
 
 
81
  ### Using Ray for Multi-Node Training
82
  For multi-node trainingโ€‹โ€‹, ensure โ€‹โ€‹all nodes are started and connected via Rayโ€‹โ€‹ before executing the training script. Below is a brief setup guide for Ray across multiple machines:
83
  #### Step 1: Start Ray on the Head Node (node0)
@@ -124,22 +130,23 @@ For LiveCodeBench, please download the data from the official website.
124
 
125
  You can run the following commands to perform inference and evaluation:
126
  ```bash
127
- git clone https://github.com/Kwai-Klear990901/KlearReasoner
128
- cd KlearReasoner/benchmarks
129
  python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
130
  python judge_math.py <path_to_inference_results>
131
  ```
132
 
 
133
  ## ๐Ÿค Citation
134
  If you find this work helpful, please cite our paper:
135
  ```bibtex
136
- @misc{su2025klearreasoneradvancingreasoningcapability,
137
- title={Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization},
138
- author={Zhenpeng Su and Leiyu Pan and Xue Bai and Dening Liu and Guanting Dong and Jiaming Huang and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
139
  year={2025},
140
- eprint={2508.07629},
141
  archivePrefix={arXiv},
142
  primaryClass={cs.LG},
143
- url={https://arxiv.org/abs/2508.07629},
144
  }
145
  ```
 
1
  ---
 
 
 
2
  base_model:
3
  - Kwai-Klear/Klear-Reasoner-8B-SFT
4
  datasets:
5
  - Kwai-Klear/KlearReasoner-MathSub-30K
6
  - Kwai-Klear/KlearReasoner-CodeSub-15K
7
+ language:
8
+ - en
9
+ license: apache-2.0
10
  metrics:
11
  - accuracy
12
+ pipeline_tag: text-generation
13
+ library_name: transformers
14
  ---
15
 
16
+ # โœจ Klear-Reasoner-8B: Advancing Reasoning Capability via CE-GPPO
17
+
18
+ This repository contains the `Klear-Reasoner-8B` model, a powerful reasoning model that implements innovations from the paper **[CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712)**.
19
 
20
+ **CE-GPPO** introduces a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. This approach effectively mitigates entropy instability and consistently outperforms strong baselines across different model scales on mathematical reasoning benchmarks.
 
21
 
22
  | Resource | Link |
23
  |---|---|
24
+ | ๐Ÿ“„ Paper | [CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712) |
25
+ | ๐Ÿง‘โ€๐Ÿ’ป Code & Issues | [GitHub: Kwai-Klear/CE-GPPO](https://github.com/Kwai-Klear/CE-GPPO) |
26
  | ๐Ÿค— Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
27
  | ๐Ÿค— Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
28
  | ๐Ÿค— Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
 
29
  | ๐Ÿ“ง Contact | suzhenpeng13@163.com |
30
 
31
  ## ๐Ÿ“Œ Overview
 
43
  | **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
44
 
45
  The model combines:
46
+ 1. **Quality-centric long CoT SFT** โ€“ distilled from DeepSeek-R1-0528.
47
+ 2. **Gradient-Preserving Clipping Policy Optimization (CE-GPPO)** โ€“ a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
48
 
49
  ---
50
 
 
69
 
70
  > We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
71
 
 
72
  ---
73
 
74
  ## ๐Ÿงช Training
75
  ### Configure the experimental environment
76
  ```bash
77
+ git clone https://github.com/Kwai-Klear/CE-GPPO
78
+ cd CE-GPPO
79
+ pip install -e .
80
  pip install -r requirements.txt
81
  ```
82
  For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
83
 
84
+ ### Download a pre-trained checkpoint & data
85
+ We trained our model based on [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), using the [KlearReasoner-MathSub-30K](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) dataset for training, with [AIME2024](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify.json) and [AIME2025](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify25.json) as the validation sets.
86
+
87
  ### Using Ray for Multi-Node Training
88
  For multi-node trainingโ€‹โ€‹, ensure โ€‹โ€‹all nodes are started and connected via Rayโ€‹โ€‹ before executing the training script. Below is a brief setup guide for Ray across multiple machines:
89
  #### Step 1: Start Ray on the Head Node (node0)
 
130
 
131
  You can run the following commands to perform inference and evaluation:
132
  ```bash
133
+ git clone https://github.com/Kwai-Klear/CE-GPPO
134
+ cd CE-GPPO/benchmarks
135
  python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
136
  python judge_math.py <path_to_inference_results>
137
  ```
138
 
139
+ ---
140
  ## ๐Ÿค Citation
141
  If you find this work helpful, please cite our paper:
142
  ```bibtex
143
+ @misc{su2025cegppocontrollingentropygradientpreserving,
144
+ title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
145
+ author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
146
  year={2025},
147
+ eprint={2509.20712},
148
  archivePrefix={arXiv},
149
  primaryClass={cs.LG},
150
+ url={https://arxiv.org/abs/2509.20712},
151
  }
152
  ```