Abstract
Experience replay techniques for large language model post-training balance staleness variance and computational costs while maintaining performance and policy entropy.
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
Community
Experience replay can cut LLM RL training compute by up to ~40% (without hurting final accuracy—and sometimes improving it).
Experience replay (reusing past rollouts) is a staple of classical RL, but is still underexplored in LLM post-training—where the default is “stay as on-policy as possible”.
In modern LLM RL pipelines, rollout generation can be >80% of total GPU time. Reusing rollouts even a little can save a lot of compute.
We studied a minimal, easy-to-drop-in replay buffer for async RL:
inference workers continuously push trajectories into a FIFO buffer
trainers sample uniformly from the buffer (sampling doesn’t remove items)
Main result: replay can slightly hurt performance per gradient step, but improves performance per unit of compute.
On MATH with Qwen2.5-7B, a well-chosen buffer reaches the same accuracy with up to ~40% less compute.
We also see a “slow-but-stable” effect: larger buffers learn more slowly, but training becomes more stable and can sometimes reach higher peak accuracy.
Replay can also help preserve output diversity → better pass@k for k>1.
Intuition: replay changes the effective training distribution. Mixing in older samples makes it more diverse over time than in purely on-policy training, which helps stabilize the training.
We also explored extensions beyond uniform replay:
- alternative losses beyond GRPO
- alternative sampling (e.g., biasing toward positive/correct trajectories)
Early results look promising.
Theory: SGD with replay can converge faster as a function of compute by optimizing the trade-off between:
- expensive rollout generation
- staleness-induced variance
- sample correlations / diversity
It connects practical knobs—buffer size + replay ratio—to those costs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts (2026)
- Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning (2026)
- Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs (2026)
- LLMs Can Learn to Reason Via Off-Policy RL (2026)
- GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control (2026)
- QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch (2026)
- GIPO: Gaussian Importance Sampling Policy Optimization (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.08706 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper