Kwai-Klear
/

Klear-Reasoner-8B

@@ -1,28 +1,31 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - Kwai-Klear/Klear-Reasoner-8B-SFT
 datasets:
 - Kwai-Klear/KlearReasoner-MathSub-30K
 - Kwai-Klear/KlearReasoner-CodeSub-15K
 metrics:
 - accuracy
 ---
-# ✨ Klear-Reasoner-8B
-We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose **G**radient-**P**reserving clipping **P**olicy **O**ptimization (**GPPO**) that gently backpropagates gradients from clipped tokens.
 | Resource | Link |
 |---|---|
-| 📝 Preprints | [Paper](https://arxiv.org/pdf/2508.07629) |
-| 🤗 Daily Paper | [Paper](https://huggingface.co/papers/2508.07629) |
 | 🤗 Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
 | 🤗 Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
 | 🤗 Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
-| 🐛 Issues & Discussions | [GitHub Issues](https://github.com/suu990901/KlearReasoner/issues) |
 | 📧 Contact | suzhenpeng13@163.com |
 ## 📌 Overview
@@ -40,8 +43,8 @@ Klear-Reasoner is an 8-billion-parameter reasoning model that achieves **SOTA**
 | **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
 The model combines:
-1. **Quality-centric long CoT SFT** – distilled from DeepSeek-R1-0528.
-2. **Gradient-Preserving Clipping Policy Optimization (GPPO)** – a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
 ---
@@ -66,18 +69,21 @@ When we expand the inference budget to 64K and adopt the YaRN method with a scal
 > We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
 ---
 ## 🧪 Training
 ### Configure the experimental environment
 ```bash
-git clone https://github.com/Kwai-Klear990901/Klear_Reasoner
-cd Klear_Reasoner
 pip install -r requirements.txt
 ```
 For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
 ### Using Ray for Multi-Node Training
 For multi-node training, ensure all nodes are started and connected via Ray before executing the training script. Below is a brief setup guide for Ray across multiple machines:
 #### Step 1: Start Ray on the Head Node (node0)
@@ -124,22 +130,23 @@ For LiveCodeBench, please download the data from the official website.
 You can run the following commands to perform inference and evaluation:
 ```bash
-git clone https://github.com/Kwai-Klear990901/KlearReasoner
-cd KlearReasoner/benchmarks
 python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
 python judge_math.py <path_to_inference_results>
 ```
 ## 🤝 Citation
 If you find this work helpful, please cite our paper:
 ```bibtex
-@misc{su2025klearreasoneradvancingreasoningcapability,
-      title={Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization},
-      author={Zhenpeng Su and Leiyu Pan and Xue Bai and Dening Liu and Guanting Dong and Jiaming Huang and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
       year={2025},
-      eprint={2508.07629},
       archivePrefix={arXiv},
       primaryClass={cs.LG},
-      url={https://arxiv.org/abs/2508.07629},
 }
 ```

 ---
 base_model:
 - Kwai-Klear/Klear-Reasoner-8B-SFT
 datasets:
 - Kwai-Klear/KlearReasoner-MathSub-30K
 - Kwai-Klear/KlearReasoner-CodeSub-15K
+language:
+- en
+license: apache-2.0
 metrics:
 - accuracy
+pipeline_tag: text-generation
+library_name: transformers
 ---
+# ✨ Klear-Reasoner-8B: Advancing Reasoning Capability via CE-GPPO
+This repository contains the `Klear-Reasoner-8B` model, a powerful reasoning model that implements innovations from the paper **[CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712)**.
+**CE-GPPO** introduces a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. This approach effectively mitigates entropy instability and consistently outperforms strong baselines across different model scales on mathematical reasoning benchmarks.
 | Resource | Link |
 |---|---|
+| 📄 Paper | [CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning](https://huggingface.co/papers/2509.20712) |
+| 🧑‍💻 Code & Issues | [GitHub: Kwai-Klear/CE-GPPO](https://github.com/Kwai-Klear/CE-GPPO) |
 | 🤗 Model Hub | [Klear-Reasoner-8B](https://huggingface.co/Kwai-Klear/Klear-Reasoner-8B) |
 | 🤗 Dataset Hub | [Math RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) |
 | 🤗 Dataset Hub | [Code RL](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-CodeSub-15K) |
 | 📧 Contact | suzhenpeng13@163.com |
 ## 📌 Overview
 | **Score** | **90.5 %** | **83.2 %** | **66.0 %** | **58.1 %** |
 The model combines:
+1.  **Quality-centric long CoT SFT** – distilled from DeepSeek-R1-0528.
+2.  **Gradient-Preserving Clipping Policy Optimization (CE-GPPO)** – a novel RL method that **keeps gradients from clipped tokens** to boost exploration & convergence.
 ---
 > We report the average `pass@1` results (avg@_n_), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
 ---
 ## 🧪 Training
 ### Configure the experimental environment
 ```bash
+git clone https://github.com/Kwai-Klear/CE-GPPO
+cd CE-GPPO
+pip install -e .
 pip install -r requirements.txt
 ```
 For the code, we use [Firejail](https://github.com/netblue30/firejail) for the **sandbox** environment. Additionally, we implemented multi-process control based on [Pebble](https://github.com/noxdafox/pebble), enabling automatic resource reclamation upon task timeout. For mathematics, we use [math_verify](https://github.com/huggingface/Math-Verify) for judging.
+### Download a pre-trained checkpoint & data
+We trained our model based on [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) and [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B), using the [KlearReasoner-MathSub-30K](https://huggingface.co/datasets/Kwai-Klear/KlearReasoner-MathSub-30K) dataset for training, with [AIME2024](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify.json) and [AIME2025](https://github.com/Kwai-Klear/CE-GPPO/blob/main/benchmarks/aime960_math_verify25.json) as the validation sets.
 ### Using Ray for Multi-Node Training
 For multi-node training, ensure all nodes are started and connected via Ray before executing the training script. Below is a brief setup guide for Ray across multiple machines:
 #### Step 1: Start Ray on the Head Node (node0)
 You can run the following commands to perform inference and evaluation:
 ```bash
+git clone https://github.com/Kwai-Klear/CE-GPPO
+cd CE-GPPO/benchmarks
 python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
 python judge_math.py <path_to_inference_results>
 ```
+---
 ## 🤝 Citation
 If you find this work helpful, please cite our paper:
 ```bibtex
+@misc{su2025cegppocontrollingentropygradientpreserving,
+      title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
+      author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
       year={2025},
+      eprint={2509.20712},
       archivePrefix={arXiv},
       primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2509.20712},
 }
 ```