One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
Abstract
Reinforcement learning with carefully designed single training samples can significantly enhance the reasoning abilities of large language models across multiple disciplines, outperforming traditional approaches that rely on large datasets.
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
Community
This work discusses the potential of lifting broader reasoning ability by learning from one high-quality sample. In polymath learning, the quality of samples can be selected through the lens of salient math skills and categories. The model learned from the polymath sample outperformances the one learned from dataset thousand times larger in multidisciplinary reasoning tasks, indicating the significance of deliberate selection, and synthesis of training samples to unlock reasoning capabilities more
efficiently, rather than simply scaling data volume.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)
- From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs (2025)
- Tailored Primitive Initialization is the Secret Key to Reinforcement Learning (2025)
- AIR: Post-training Data Selection for Reasoning via Attention Head Influence (2025)
- Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning (2026)
- OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe (2025)
- Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper