Safetensors
English
qwen3
Klear-Reasoner-8B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, library name, and update CE-GPPO paper/code links
f5548f1 verified
|
raw
history blame
7.33 kB
metadata
base_model:
  - Kwai-Klear/Klear-Reasoner-8B-SFT
datasets:
  - Kwai-Klear/KlearReasoner-MathSub-30K
  - Kwai-Klear/KlearReasoner-CodeSub-15K
language:
  - en
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: text-generation
library_name: transformers

✨ Klear-Reasoner-8B: Advancing Reasoning Capability via CE-GPPO

This repository contains the Klear-Reasoner-8B model, a powerful reasoning model that implements innovations from the paper CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning.

CE-GPPO introduces a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. This approach effectively mitigates entropy instability and consistently outperforms strong baselines across different model scales on mathematical reasoning benchmarks.

Resource Link
πŸ“„ Paper CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
πŸ§‘β€πŸ’» Code & Issues GitHub: Kwai-Klear/CE-GPPO
πŸ€— Model Hub Klear-Reasoner-8B
πŸ€— Dataset Hub Math RL
πŸ€— Dataset Hub Code RL
πŸ“§ Contact suzhenpeng13@163.com

πŸ“Œ Overview

Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8).

Klear-Reasoner is an 8-billion-parameter reasoning model that achieves SOTA performance on challenging math and coding benchmarks:

Benchmark AIME 2024 AIME 2025 LiveCodeBench V5 LiveCodeBench V6
Score 90.5 % 83.2 % 66.0 % 58.1 %

The model combines:

  1. Quality-centric long CoT SFT – distilled from DeepSeek-R1-0528.
  2. Gradient-Preserving Clipping Policy Optimization (CE-GPPO) – a novel RL method that keeps gradients from clipped tokens to boost exploration & convergence.

Evaluation

When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. Evaluation is coming soon, stay tuned.

πŸ“Š Benchmark Results (Pass@1)

Model AIME2024
avg@64
AIME2025
avg@64
HMMT2025
avg@64
LCB V5
avg@8
LCB V6
avg@8
AReal-boba-RL-7B 61.9 48.3 29.4 34.3 31.0†
MiMo-7B-RL 68.2 55.4 35.7 57.8 49.3
Skywork-OR1-7B 70.2 54.6 35.7 47.6 42.7
AceReason-Nemotron-1.1-7B 72.6 64.8 42.9 57.2 52.1
POLARIS-4B-Preview 81.2 79.4 58.7 58.5† 53.0†
Qwen3-8B 76.0 67.3 44.7† 57.5 48.4†
Deepseek-R1-0528-Distill-8B 86.0 76.3 61.5 61.0† 51.6†
OpenReasoning-Nemotron-7B 84.7 78.2 63.5 _65.6_† _56.3_†
Klear-Reasoner-8B-SFT 75.6 70.1 57.6 58.5 49.6
Klear-Reasoner-8B 83.2 75.6 60.3 61.6 53.1
w/ 64K Inference Budget 90.5 83.2 70.8 66.0 58.1

We report the average pass@1 results (avg@n), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).


πŸ§ͺ Training

Configure the experimental environment

git clone https://github.com/Kwai-Klear/CE-GPPO
cd CE-GPPO
pip install -e .
pip install -r requirements.txt

For the code, we use Firejail for the sandbox environment. Additionally, we implemented multi-process control based on Pebble, enabling automatic resource reclamation upon task timeout. For mathematics, we use math_verify for judging.

Download a pre-trained checkpoint & data

We trained our model based on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-1.5B, using the KlearReasoner-MathSub-30K dataset for training, with AIME2024 and AIME2025 as the validation sets.

Using Ray for Multi-Node Training

For multi-node training​​, ensure ​​all nodes are started and connected via Ray​​ before executing the training script. Below is a brief setup guide for Ray across multiple machines:

Step 1: Start Ray on the Head Node (node0)

On the first node (typically called node0), run:

ray start --head --dashboard-host=0.0.0.0

Get the IP address of the master node.

MASTER_IP=$(hostname -I | awk '{print $1}')

Step 2: Connect Other Nodes (e.g., node1)

On each additional worker node (e.g., node1), run the following, replacing the IP with that of your head node:

ray start --address=\"$MASTER_IP:6379\"

RL Training

Run the following script on the master node to start the training task.

bash recipe/dapo/perf_run_dapo_ours_math.sh # For Math RL
bash recipe/dapo/perf_run_dapo_ours_code.sh # For Code RL

In the startup script, you need to set the following variables:

YOUR_MODEL_PATH="<your_model_path>"
CKPTS_SAVE_DIR="<ckpts_save_path>"
YOUR_TRAIN_FILE="<train_data_path>"
YOUR_TEST_FILE="<test_data_path>"

Evaluation

When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5.

The evaluation data for AIME24, AIME25, and HMMT2025 are available in our GitHub repository under the benchmarks directory. For LiveCodeBench, please download the data from the official website.

You can run the following commands to perform inference and evaluation:

git clone https://github.com/Kwai-Klear/CE-GPPO  
cd CE-GPPO/benchmarks  
python inference.py --model <KlearReasoner-8B_path> --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl  
python judge_math.py <path_to_inference_results>

🀝 Citation

If you find this work helpful, please cite our paper:

@misc{su2025cegppocontrollingentropygradientpreserving,
      title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning}, 
      author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
      year={2025},
      eprint={2509.20712},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.20712}, 
}