Safetensors
English
qwen3
Suu commited on
Commit
46de3f8
Β·
verified Β·
1 Parent(s): d30a73e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -6
README.md CHANGED
@@ -11,6 +11,17 @@ metrics:
11
  - accuracy
12
  ---
13
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  # ✨ Klear-Reasoner-8B
16
  We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
@@ -133,13 +144,42 @@ python judge_math.py <path_to_inference_results>
133
  ## 🀝 Citation
134
  If you find this work helpful, please cite our paper:
135
  ```bibtex
136
- @misc{su2025klearreasoneradvancingreasoningcapability,
137
- title={Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization},
138
- author={Zhenpeng Su and Leiyu Pan and Xue Bai and Dening Liu and Guanting Dong and Jiaming Huang and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
139
  year={2025},
140
- eprint={2508.07629},
141
  archivePrefix={arXiv},
142
  primaryClass={cs.LG},
143
- url={https://arxiv.org/abs/2508.07629},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  }
145
- ```
 
 
11
  - accuracy
12
  ---
13
 
14
+ ## πŸ“£ Latest News
15
+ **[September 26, 2025]** πŸ” We further explored GPPO in depth and proposed **CE-GPPO**, focusing on the impact of ppo-clip tokens on entropy. πŸ“„ The paper is available on [arXiv](https://arxiv.org/pdf/2509.20712) and [HuggingFace Daily](https://huggingface.co/papers/2509.20712).
16
+
17
+ **[August 12, 2025]** πŸš€ We released the checkpoint for [KlearReasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B), along with the training data.
18
+
19
+ **[August 11, 2025]** πŸ”¬ KlearReasoner-8B conducted preliminary exploration of GPPO.
20
+
21
+ **[August 11, 2025]** πŸ† We released KlearReasoner-8B, achieving SOTA performance among small-scale 7/8B models.
22
+
23
+ **[August 11, 2025]** πŸ“’ KlearReasoner is available on [arXiv](https://arxiv.org/pdf/2508.07629) and [HuggingFace Daily](https://huggingface.co/papers/2508.07629).
24
+
25
 
26
  # ✨ Klear-Reasoner-8B
27
  We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
 
144
  ## 🀝 Citation
145
  If you find this work helpful, please cite our paper:
146
  ```bibtex
147
+ @misc{su2025cegppocontrollingentropygradientpreserving,
148
+ title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
149
+ author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
150
  year={2025},
151
+ eprint={2509.20712},
152
  archivePrefix={arXiv},
153
  primaryClass={cs.LG},
154
+ url={https://arxiv.org/abs/2509.20712},
155
+ }
156
+ ```
157
+
158
+
159
+ ```bibtex
160
+ @article{DBLP:journals/corr/abs-2508-07629,
161
+ author = {Zhenpeng Su and
162
+ Leiyu Pan and
163
+ Xue Bai and
164
+ Dening Liu and
165
+ Guanting Dong and
166
+ Jiaming Huang and
167
+ Wenping Hu and
168
+ Fuzheng Zhang and
169
+ Kun Gai and
170
+ Guorui Zhou},
171
+ title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
172
+ Clipping Policy Optimization},
173
+ journal = {CoRR},
174
+ volume = {abs/2508.07629},
175
+ year = {2025},
176
+ url = {https://doi.org/10.48550/arXiv.2508.07629},
177
+ doi = {10.48550/ARXIV.2508.07629},
178
+ eprinttype = {arXiv},
179
+ eprint = {2508.07629},
180
+ timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
181
+ biburl = {https://dblp.org/rec/journals/corr/abs-2508-07629.bib},
182
+ bibsource = {dblp computer science bibliography, https://dblp.org}
183
  }
184
+ ```
185
+