File size: 3,566 Bytes
e5d3d75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3d17ad
 
 
 
 
 
 
 
 
 
 
 
 
 
e5d3d75
 
 
 
 
 
 
 
2872238
 
 
 
 
 
 
 
e5d3d75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
license: apache-2.0
language:
- en
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
---

# 🚀 GRPO-LEAD: Efficient Reasoning Enhancement for Mathematical Tasks

---

## 📚 Overview

**GRPO-LEAD** (**GRPO** with **L**ength-dependent rewards, **E**xplicit penalties, and **A**dvantage reweighting for **D**ifficulty) is an advanced reinforcement learning pipeline designed to fine-tune large language models (LLMs) for concise, accurate, and efficient reasoning in mathematical tasks.

---

## 📊 Performance Benchmarks

The following benchmarks were conducted on AIME24 and AIME25 datasets, evaluated with parameters: 14k maximum tokens, temperature of 0.6, min-p of 0.01, and 32 samples per question.

| **Model**           | **AIME24 Cons@32** | **AIME24 Pass@1** | **AIME24 Avg. Length** | **AIME25 Cons@32** | **AIME25 Pass@1** | **AIME25 Avg. Length** |
|---------------------|--------------------|-------------------|------------------------|--------------------|-------------------|------------------------|
| **DeepSeek-Distlled-14B**    | 0.800              | 0.614             | 9182                   | 0.633              | 0.429             | 10046                  |
| **Light-R1-14B-DS**    | 0.833              | 0.641             | 9571                   | 0.767              | 0.505             | 10194                  |
| **LEAD-14B (ours)** | **0.867**          | **0.650**         | **8267**               | **0.767**          | **0.539**         | **8668**               |

Our GRPO-LEAD model achieves superior consistency and higher accuracy, demonstrating significantly improved reasoning efficiency as evidenced by shorter average reasoning lengths.

---

## ⚙️ Usage

To achieve the best performance in solving mathematical problems, simply use the following prompt format:
```python
[
    {
        "role": "user",
        "content": question + "\nLet's think step by step and output the final answer within \\boxed{}."
    }
]
```

---

## 📂 Code and Documentation

For complete details, codebase, and usage examples, please visit our GitHub repository:

[**📌 GitHub Repository**](https://github.com/aeroplanepaper/GRPO-LEAD)

---

## 📦 Dataset: GRPO-LEAD-SFTData

We release [**GRPO-LEAD-SFTData**](https://huggingface.co/datasets/PlanePaper/GRPO-LEAD-SFTData), a curated collection of **12,153** high-quality mathematical reasoning samples for supervised fine-tuning. Generated via [**QwQ-32B**](https://huggingface.co/Qwen/QwQ-32B).
Derived primarily from the **DeepScaler** dataset ([DeepScaler](https://github.com/agentica-project/rllm)), we retain only examples with **difficulty > 1**, targeting challenging problem-solving scenarios. All entries are structured for seamless integration with [**LLaMA Factory**](https://github.com/hiyouga/LLaMA-Factory) and follow a standardized SFT-ready format.

Used as the training data for GRPO-LEAD’s supervised fine-tuning stage, this dataset is able to increase the model's base capability in solving mathematical problems.,

---
## 📖 Citation

If you find our work useful, please cite it as:

```bibtex
@misc{zhang2025grpoleaddifficultyawarereinforcementlearning,
      title={GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models}, 
      author={Jixiao Zhang and Chunsheng Zuo},
      year={2025},
      eprint={2504.09696},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.09696}, 
}
```
Enjoy exploring GRPO-LEAD! 🚀✨