GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity
Abstract
Three seemingly distinct training methods for language models are shown to be variations of a single approach based on standard deviation adjustment, with the disagreement among sampled answers determining learning effectiveness and update magnitude.
Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree. When such a model is trained, it answers each problem many times, and an automatic checker marks every answer right or wrong. The standard deviation of those marks measures the disagreement: largest when the answers split evenly between right and wrong, and zero when they all agree. Group Relative Policy Optimization (GRPO) divides by this number, GRPO Done Right (Dr. GRPO) drops the division, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) discards the groups where it is zero. Each is presented as its own fix, yet this paper proves they are three settings of one dial. That dial is not cosmetic: for right-or-wrong rewards, the disagreement is exactly the size of the training update, the group-standard-deviation identity. A split group teaches the most, while a unanimous group teaches nothing and falls silent. The same result says which problems deserve the most weight and how many tries each one needs. This paper confirms the intuition on a large real difficulty dataset (Big-Math) and in a controlled training run. What looks like a harmless normalization step is the dial that decides where learning happens and how strongly.
Community
Three of the hottest RL methods for reasoning LLMs are secretly the same trick.
That knob is the group's reward standard deviation. For right/wrong rewards it isn't a normalizer sitting next to the gradient, it IS the gradient's size: sigma = sqrt(k(G-k))/G. A group split evenly between right and wrong teaches the most; a unanimous group teaches nothing and goes silent. GRPO divides by sigma, Dr. GRPO drops the division, DAPO throws away the groups where it's zero. Out of the identity fall closed forms for the two knobs you actually set: how many samples per prompt you need, and the silent-group rate DAPO discards. Verified on Big-Math and a live GRPO run.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- When More Sampling Hurts: The Modal Ceiling and Correlation Ceiling of Test-Time Scaling (2026)
- Self-Supervised On-Policy Distillation for Reasoning Language Models (2026)
- When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models (2026)
- Reliable Chain-of-Thought via Prefix Consistency (2026)
- SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback (2026)
- Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling (2026)
- Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2607.00152 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper