dylan-marimo-io commited on
Commit
b01d078
Β·
verified Β·
1 Parent(s): acb0bbf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -5
README.md CHANGED
@@ -1,13 +1,78 @@
1
  ---
2
  title: Reward Policy Intuition
3
- emoji: πŸƒ
4
- colorFrom: indigo
5
- colorTo: purple
6
  sdk: docker
7
  pinned: true
8
  license: mit
 
9
  short_description: 'GRPO vs GDPO: Understanding Multi-Reward Policy Optimization'
10
  ---
 
 
 
 
 
 
 
 
 
 
11
 
12
- Check out marimo at <https://github.com/marimo-team/marimo>
13
- Check out the configuration reference at <https://huggingface.co/docs/hub/spaces-config-reference>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Reward Policy Intuition
3
+ emoji: πŸƒπŸ“Š
4
+ colorFrom: purple
5
+ colorTo: red
6
  sdk: docker
7
  pinned: true
8
  license: mit
9
+ arxiv: 2601.05242
10
  short_description: 'GRPO vs GDPO: Understanding Multi-Reward Policy Optimization'
11
  ---
12
+ # GRPO vs GDPO: Why Normalization Order Matters
13
+
14
+ Interactive visualization demonstrating **advantage collapse** in multi-reward reinforcement learning, and how GDPO fixes it.
15
+
16
+ Based on [NVIDIA's GDPO paper (arXiv:2601.05242)](https://arxiv.org/abs/2601.05242).
17
+
18
+ ## The Problem
19
+
20
+ When training LLMs with multiple reward signals (correctness, format, style), GRPO normalizes the *combined* reward. This causes **advantage collapse**β€”smal
21
+ ler-scale rewards get washed out by larger-scale ones.
22
 
23
+ | Method | Normalization | Result |
24
+ |--------|---------------|--------|
25
+ | **GRPO** | Aggregate β†’ Normalize | Small-scale signals lost |
26
+ | **GDPO** | Normalize β†’ Aggregate | All signals preserved |
27
+
28
+ ## The Solution
29
+
30
+ GDPO normalizes each reward dimension *independently* (to mean=0, std=1) before combining them. This ensures every reward contributes proportionally to its
31
+ weight, regardless of original scale.
32
+
33
+ $$\text{GRPO: } A_j = \frac{\sum_i r_j^{(i)} - \mu}{\sigma} \quad \text{vs} \quad \text{GDPO: } A_j = \sum_i \frac{r_j^{(i)} - \mu^{(i)}}{\sigma^{(i)}}$$
34
+
35
+ ### Binary Rewards Widget
36
+ Based on the [Berkeley Function Calling Leaderboard (BFCL)](https://gorilla.cs.berkeley.edu/leaderboard.html) dataset. Toggle binary rewards for 12 rollouts:
37
+ - **Correctness**: Does the function call execute?
38
+ - **Style**: Are arguments formatted correctly?
39
+ - **Conciseness**: Free of redundant parameters?
40
+
41
+ See how GRPO assigns **identical advantages** to `[1,0,1]` and `[0,1,1]` (same total), while GDPO differentiates them.
42
+
43
+ ### Training Convergence
44
+ Train a toy Bernoulli policy on 3 binary rewards:
45
+ - **GDPO**: All dimensions converge to pβ‰ˆ1 independently
46
+ - **GRPO**: All dimensions collapse to the same trajectory
47
+
48
+ ## Key Visualizations
49
+
50
+ ### Advantage Bar Chart
51
+ Side-by-side comparison of GRPO vs GDPO advantages, sorted by GDPO rank. Detects and highlights advantage collapse when multiple books receive identical GRPO advantages.
52
+
53
+ ### Policy Convergence Plot
54
+ Shows probability trajectories over 150 training epochs. GDPO learns each reward dimension independently; GRPO can't distinguish which rewards matter.
55
+
56
+ ## When to Use Each
57
+
58
+ | Use GDPO | Use GRPO |
59
+ |----------|----------|
60
+ | Multiple reward scales | Single reward |
61
+ | Binary + continuous rewards | Similar scales |
62
+ | All signals matter equally | One dominant reward |
63
+
64
+ ## Implementation
65
+
66
+ It's a one-line change:
67
+ - **TRL**: `apply_gdpo: True`
68
+ - **VERL**: `adv_estimator: 'gdpo'`
69
+
70
+ ## References
71
+
72
+ - **GDPO Paper**: [NVIDIA, arXiv:2601.05242](https://arxiv.org/abs/2601.05242)
73
+ - **Code**: [github.com/NVlabs/GDPO](https://github.com/NVlabs/GDPO)
74
+ - **Dataset**: [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard.html)
75
+
76
+ ---
77
+
78
+ Check out marimo at <https://github.com/marimo-team/marimo>