Papers
arxiv:2602.03143

Self-Hinting Language Models Enhance Reinforcement Learning

Published on Feb 3
· Submitted by
Baohao Liao
on Feb 5
Authors:
,
,
,
,

Abstract

SAGE is an on-policy reinforcement learning framework that enhances GRPO by injecting self-hints during training to increase outcome diversity under sparse rewards, improving alignment of large language models.

AI-generated summary

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt x, the model samples a compact hint h (e.g., a plan or decomposition) and then generates a solution τ conditioned on (x,h). Crucially, the task reward R(x,τ) is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set h=varnothing and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Community

RL for LLMs often stalls under sparse rewards — especially with GRPO, where whole rollout groups get identical 0 rewards and learning just… dies.

💡 SAGE fixes this with a simple but powerful idea:
👉 Let the model give itself hints during training.

How it works:

  • The model samples a compact hint (plan / decomposition) before solving
  • Rewards stay unchanged (same verifier, same objective)
  • Hints only reshape sampling, preventing advantage collapse
  • At test time? No hints at all. Clean deployment.

🔥 Why it matters:

  • Turns dead-end prompts into useful learning signals
  • Acts as an adaptive curriculum driven by the model itself
  • Stays fully on-policy (no external teachers required)

📊 Results across 6 benchmarks & 3 LLMs over GRPO:

  • +2.0 on Llama-3.2-3B
  • +1.2 on Qwen2.5-7B
  • +1.3 on Qwen3-4B

Sometimes the best teacher is… yourself 😌

Code: https://github.com/BaohaoLiao/SAGE
Slide by NotebookLM:

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.03143 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.03143 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.03143 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.