arxiv:2602.09000

iGRPO: Self-Feedback-Driven LLM Reasoning

Published on Feb 9

· Submitted by

Ali on Feb 11

NVIDIA

Upvote

Authors:

Abstract

Iterative Group Relative Policy Optimization enhances mathematical reasoning in large language models through a two-stage process combining exploratory drafting and refined iterations, achieving state-of-the-art results on competitive benchmarks.

AI-generated summary

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

View arXiv page View PDF Add to collection

Community

ahatamiz

Paper submitter about 1 hour ago

Let's discuss Self-Feedback for RL Reasoning (iGRPO)

Motivation.
Current RL methods for reasoning (GRPO, DAPO, etc.) treat each generation as a one-shot attempt. The model samples, gets a reward, updates, and moves on. But humans almost never solve hard problems in one pass. We draft, re-read, spot mistakes, and refine. Existing RL pipelines don't capture this loop. Some recent methods try to close the gap with critique generation or self-verification, but these ask the model to learn auxiliary behaviors (writing critiques, producing verification rationales) that are only indirectly tied to the actual outcome reward. We wanted something simpler: what if the model's own best attempt is the feedback, and we just train it to beat that attempt?

What we built.
iGRPO is a two-stage extension of GRPO that adds self-conditioning through the model's own drafts.

Stage 1 (Exploratory Draft Generation): Sample multiple candidate solutions from the current policy. Score them with the same scalar reward you're already using. Pick the best one.
Stage 2 (Conditioned Refinement): Append that best draft to the original prompt and sample a new group of completions. Apply the standard GRPO-style clipped surrogate update only on these Stage 2 outputs.

No critic networks, no reward models, no verification rationales, no generated critiques. The best draft is the only feedback signal, and it comes for free from Stage 1 exploration.

The important part: as the policy improves across training, its Stage 1 drafts get stronger, so Stage 2 sees better conditioning, so the policy improves even more. We formally prove this monotonic improvement property under binary rewards: the expected quality of the selected draft increases as the policy's success probability increases. The model doesn't learn to copy the draft. It learns a refinement function that compounds across training.

How it differs from critique/verification approaches.
Methods like Self-Verification and Critique-GRPO require the model to produce extra text (verification steps, natural-language critiques) and then condition on that. This means the model has to allocate capacity to an auxiliary skill that isn't directly optimized by the outcome reward. iGRPO sidesteps this entirely. The conditioning signal is a full solution attempt scored by the same reward used for optimization. There's no ambiguity about what "good feedback" looks like, because it's literally the model's highest-reward output.

Key results.

Controlled comparisons (matched rollout budgets, same total completions per prompt):

Nemotron-H-8B-Base-8K: iGRPO reaches 45.04% average vs. 41.08% for GRPO (+3.96), and beats Self-Verification (42.86%) and Critique-GRPO (43.39%).
DeepSeek-R1-Distill-Qwen-7B: iGRPO at 69.87% vs. GRPO at 68.29%, with gains concentrated on multi-step benchmarks like AIME24 (56.30%) and AMC (95.00%).
OpenMath-Nemotron-7B: Even with an already strong 74.83% base, iGRPO pushes to 76.07%.
At 14B scale, gains persist: DeepSeek-R1-Distill-Qwen-14B goes from 71.29% (GRPO) to 73.02% (iGRPO); OpenMath-Nemotron-14B from 76.73% to 78.00%.

Stronger base + harder data:

OpenReasoning-Nemotron-7B trained on AceReason-Math with iGRPO achieves 85.62% on AIME24 and 79.64% on AIME25, with transfer gains on GPQA (+1.84) and MMLU-Pro (+0.91).

The refinement wrapper generalizes beyond GRPO:

Applying the same two-stage mechanism to DAPO and GSPO yields +1.19 and +1.11 average improvements respectively, under matched budgets. The gains come from the refinement interface, not GRPO-specific details.

Richer rewards help:

Swapping the binary outcome checker for a GPT-5 generative judge improves the average from 69.87% to 70.81% (+0.94), with the largest lifts on AIME24/25 and Minerva, consistent with partial credit keeping near-miss traces alive through Stage 1 selection.

Learning dynamics:

iGRPO delays premature entropy collapse. Both methods start at ~2.45 nats, but GRPO drops to 0.60 by 10% of training while iGRPO decays more gradually (0.80 at 15%). This sustained mid-training exploration lets the model recover from near-miss reasoning traces before converging.

Overhead:

Peak memory is essentially identical (~54.93 GB for both). Throughput drops from 0.41 to 0.34 samples/sec. Total training time increases by ~13% (83.3 → 94.1 GPU hours). No extra GPUs, no extra memory.

In short, iGRPO adds a self-feedback refinement loop to group-based RL that uses the model's own best draft as conditioning. It's simple, adds minimal overhead, generalizes across optimizers, and consistently improves reasoning across model families and scales.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09000 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09000 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09000 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.