arxiv:2607.00152

GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

Published on Jun 30

· Submitted by

Authors:

Abstract

Three seemingly distinct training methods for language models are shown to be variations of a single approach based on standard deviation adjustment, with the disagreement among sampled answers determining learning effectiveness and update magnitude.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree. When such a model is trained, it answers each problem many times, and an automatic checker marks every answer right or wrong. The standard deviation of those marks measures the disagreement: largest when the answers split evenly between right and wrong, and zero when they all agree. Group Relative Policy Optimization (GRPO) divides by this number, GRPO Done Right (Dr. GRPO) drops the division, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) discards the groups where it is zero. Each is presented as its own fix, yet this paper proves they are three settings of one dial. That dial is not cosmetic: for right-or-wrong rewards, the disagreement is exactly the size of the training update, the group-standard-deviation identity. A split group teaches the most, while a unanimous group teaches nothing and falls silent. The same result says which problems deserve the most weight and how many tries each one needs. This paper confirms the intuition on a large real difficulty dataset (Big-Math) and in a controlled training run. What looks like a harmless normalization step is the dial that decides where learning happens and how strongly.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

bay-yearick-lab

Paper submitter about 19 hours ago

Three of the hottest RL methods for reasoning LLMs are secretly the same trick.

That knob is the group's reward standard deviation. For right/wrong rewards it isn't a normalizer sitting next to the gradient, it IS the gradient's size: sigma = sqrt(k(G-k))/G. A group split evenly between right and wrong teaches the most; a unanimous group teaches nothing and goes silent. GRPO divides by sigma, Dr. GRPO drops the division, DAPO throws away the groups where it's zero. Out of the identity fall closed forms for the two knobs you actually set: how many samples per prompt you need, and the silent-group rate DAPO discards. Verified on Big-Math and a live GRPO run.

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2607.00152

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2607.00152 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2607.00152 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2607.00152 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.