Bopalv commited on
Commit
7caec73
·
verified ·
1 Parent(s): 390608c

Upload DGPO-Training/DGPO-Training-README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. DGPO-Training/DGPO-Training-README.md +119 -0
DGPO-Training/DGPO-Training-README.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-0.6B DGPO Training
2
+
3
+ **Difficulty-Aware Group Policy Optimization (DGPO)** - ICLR 2026
4
+
5
+ ## What is DGPO?
6
+
7
+ DGPO is an advanced reinforcement learning method that extends GRPO (Group Relative Policy Optimization) by adding difficulty-aware mechanisms. It's from the **MathForge** paper: "Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO".
8
+
9
+ ### Key Innovation
10
+
11
+ Standard GRPO has an implicit imbalance where **harder questions get lower policy updates**. DGPO fixes this by:
12
+
13
+ 1. **Difficulty-Balanced Group Advantage Estimation**
14
+ - Uses Mean Absolute Deviation (MAD) instead of standard deviation
15
+ - Normalizes advantages based on question difficulty
16
+
17
+ 2. **Difficulty-Aware Question Weighting (DQW)**
18
+ - Prioritizes harder questions during training
19
+ - Uses softmax weighting with temperature parameter
20
+
21
+ ## DGPO vs DPO vs GRPO
22
+
23
+ | Method | Type | Key Feature | Best For |
24
+ |--------|------|-------------|----------|
25
+ | **DPO** | Preference | Pairwise preferences | General alignment |
26
+ | **GRPO** | RLVR | Group-based rewards | Math reasoning |
27
+ | **DGPO** | RLVR | Difficulty-aware | Hard math problems |
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ # Install dependencies
33
+ pip install torch transformers trl peft datasets accelerate
34
+
35
+ # Clone MathForge (optional, for reference)
36
+ git clone https://github.com/AMAP-ML/MathForge.git
37
+ ```
38
+
39
+ ## Quick Start
40
+
41
+ ### Basic DGPO Training
42
+
43
+ ```bash
44
+ python train_dgpo_qwen3.py \
45
+ --model_name Qwen/Qwen3-0.6B \
46
+ --enable_dgpo \
47
+ --enable_dgpo_dqw \
48
+ --dgpo_dqw_temp 2.0
49
+ ```
50
+
51
+ ### With Custom Parameters
52
+
53
+ ```bash
54
+ python train_dgpo_qwen3.py \
55
+ --model_name Qwen/Qwen3-0.6B \
56
+ --dataset_name DigitalLearningGmbH/MATH-lighteval \
57
+ --enable_dgpo \
58
+ --enable_dgpo_dqw \
59
+ --dgpo_dqw_temp 2.0 \
60
+ --num_generations 8 \
61
+ --max_completion_length 1024 \
62
+ --learning_rate 1e-6 \
63
+ --num_train_epochs 3
64
+ ```
65
+
66
+ ### Using Shell Script
67
+
68
+ ```bash
69
+ chmod +x run_dgpo.sh
70
+ ./run_dgpo.sh
71
+ ```
72
+
73
+ ## DGPO Parameters
74
+
75
+ | Parameter | Default | Description |
76
+ |-----------|---------|-------------|
77
+ | `--enable_dgpo` | True | Enable DGPO algorithm |
78
+ | `--enable_dgpo_dqw` | True | Enable Difficulty-aware Question Weighting |
79
+ | `--dgpo_dqw_temp` | 2.0 | Temperature for DQW (higher = more focus on hard questions) |
80
+
81
+ ## Training Parameters
82
+
83
+ | Parameter | Default | Description |
84
+ |-----------|---------|-------------|
85
+ | `--num_generations` | 4 | Number of completions per prompt |
86
+ | `--max_completion_length` | 512 | Maximum tokens in completion |
87
+ | `--learning_rate` | 5e-7 | Learning rate (use small LR for DGPO) |
88
+ | `--num_train_epochs` | 1 | Number of training epochs |
89
+ | `--beta` | 0.0 | KL coefficient (0 = no reference model) |
90
+ | `--temperature` | 0.7 | Sampling temperature |
91
+
92
+ ## Performance
93
+
94
+ From the MathForge paper (Qwen2.5-Math-7B):
95
+
96
+ | Method | AIME24 | AIME25 | AMC23 | MATH500 | Avg. |
97
+ |--------|--------|--------|-------|---------|------|
98
+ | GRPO | 20.94 | 8.44 | 58.98 | 72.20 | 37.61 |
99
+ | **DGPO** | 23.85 | 10.21 | **61.02** | 74.25 | **39.79** |
100
+ | MathForge | 24.58 | **12.60** | 59.84 | **79.95** | **42.17** |
101
+
102
+ **DGPO improves over GRPO by +2.18% on average!**
103
+
104
+ ## References
105
+
106
+ - **Paper**: [Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO](https://arxiv.org/abs/2601.20614)
107
+ - **Code**: [AMAP-ML/MathForge](https://github.com/AMAP-ML/MathForge)
108
+ - **Conference**: ICLR 2026
109
+
110
+ ## Citation
111
+
112
+ ```bibtex
113
+ @article{dai2026harder,
114
+ title={Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation},
115
+ author={Dai, Yanqi and Ji, Yuxiang and Zhang, Xiao and Wang, Yong and Chu, Xiangxiang and Lu, Zhiwu},
116
+ journal={ICLR},
117
+ year={2026}
118
+ }
119
+ ```