Update README.md
Browse files
README.md
CHANGED
|
@@ -11,6 +11,17 @@ metrics:
|
|
| 11 |
- accuracy
|
| 12 |
---
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# β¨ Klear-Reasoner-8B
|
| 16 |
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
|
|
@@ -133,13 +144,42 @@ python judge_math.py <path_to_inference_results>
|
|
| 133 |
## π€ Citation
|
| 134 |
If you find this work helpful, please cite our paper:
|
| 135 |
```bibtex
|
| 136 |
-
@misc{
|
| 137 |
-
title={
|
| 138 |
-
author={Zhenpeng Su and Leiyu Pan and
|
| 139 |
year={2025},
|
| 140 |
-
eprint={
|
| 141 |
archivePrefix={arXiv},
|
| 142 |
primaryClass={cs.LG},
|
| 143 |
-
url={https://arxiv.org/abs/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
}
|
| 145 |
-
```
|
|
|
|
|
|
| 11 |
- accuracy
|
| 12 |
---
|
| 13 |
|
| 14 |
+
## π£ Latest News
|
| 15 |
+
**[September 26, 2025]** π We further explored GPPO in depth and proposed **CE-GPPO**, focusing on the impact of ppo-clip tokens on entropy. π The paper is available on [arXiv](https://arxiv.org/pdf/2509.20712) and [HuggingFace Daily](https://huggingface.co/papers/2509.20712).
|
| 16 |
+
|
| 17 |
+
**[August 12, 2025]** π We released the checkpoint for [KlearReasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B), along with the training data.
|
| 18 |
+
|
| 19 |
+
**[August 11, 2025]** π¬ KlearReasoner-8B conducted preliminary exploration of GPPO.
|
| 20 |
+
|
| 21 |
+
**[August 11, 2025]** π We released KlearReasoner-8B, achieving SOTA performance among small-scale 7/8B models.
|
| 22 |
+
|
| 23 |
+
**[August 11, 2025]** π’ KlearReasoner is available on [arXiv](https://arxiv.org/pdf/2508.07629) and [HuggingFace Daily](https://huggingface.co/papers/2508.07629).
|
| 24 |
+
|
| 25 |
|
| 26 |
# β¨ Klear-Reasoner-8B
|
| 27 |
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
|
|
|
|
| 144 |
## π€ Citation
|
| 145 |
If you find this work helpful, please cite our paper:
|
| 146 |
```bibtex
|
| 147 |
+
@misc{su2025cegppocontrollingentropygradientpreserving,
|
| 148 |
+
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
|
| 149 |
+
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
|
| 150 |
year={2025},
|
| 151 |
+
eprint={2509.20712},
|
| 152 |
archivePrefix={arXiv},
|
| 153 |
primaryClass={cs.LG},
|
| 154 |
+
url={https://arxiv.org/abs/2509.20712},
|
| 155 |
+
}
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
```bibtex
|
| 160 |
+
@article{DBLP:journals/corr/abs-2508-07629,
|
| 161 |
+
author = {Zhenpeng Su and
|
| 162 |
+
Leiyu Pan and
|
| 163 |
+
Xue Bai and
|
| 164 |
+
Dening Liu and
|
| 165 |
+
Guanting Dong and
|
| 166 |
+
Jiaming Huang and
|
| 167 |
+
Wenping Hu and
|
| 168 |
+
Fuzheng Zhang and
|
| 169 |
+
Kun Gai and
|
| 170 |
+
Guorui Zhou},
|
| 171 |
+
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
|
| 172 |
+
Clipping Policy Optimization},
|
| 173 |
+
journal = {CoRR},
|
| 174 |
+
volume = {abs/2508.07629},
|
| 175 |
+
year = {2025},
|
| 176 |
+
url = {https://doi.org/10.48550/arXiv.2508.07629},
|
| 177 |
+
doi = {10.48550/ARXIV.2508.07629},
|
| 178 |
+
eprinttype = {arXiv},
|
| 179 |
+
eprint = {2508.07629},
|
| 180 |
+
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
|
| 181 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-2508-07629.bib},
|
| 182 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
| 183 |
}
|
| 184 |
+
```
|
| 185 |
+
|