pipeline_tag: text-generation
library_name: transformers
license: cc-by-nc-4.0
tags:
- agentic-reasoning
- tool-use
- LLM
- Qwen
Demystifying Reinforcement Learning in Agentic Reasoning
Paper: Demystifying Reinforcement Learning in Agentic Reasoning Project Page: Open-AgentRL Collection
π― About This Repository
This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies.
π Introduction
In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal:
- π― Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives
- β‘ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency
- π§ Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks.
π¦ Resources
| Type | Name | Link |
|---|---|---|
| π Dataset | 3K Agentic SFT Data | π€ HuggingFace |
| π Dataset | 30K Agentic RL Data | π€ HuggingFace |
| π€ Model | Qwen2.5-7B-RA-SFT | π€ HuggingFace |
| π€ Model | Qwen3-4B-RA-SFT | π€ HuggingFace |
| π€ Model | DemyAgent-4B | π€ HuggingFace |
Note:
- Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data
- DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe
π Performance
We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks.
Benchmark Results
| MATH | Science | Code | ||
|---|---|---|---|---|
| Method | AIME2024 | AIME2025 | GPQA-Diamond | LiveCodeBench-v6 |
| Self-Contained Reasoning | ||||
| Qwen2.5-7B-Instruct | 16.7 | 10.0 | 31.3 | 15.2 |
| Qwen3-4B-Instruct-2507 | 63.3 | 47.4 | 52.0 | 35.1 |
| Qwen2.5-72B-Instruct | 18.9 | 15.0 | 49.0 | - |
| DeepSeek-V3 | 39.2 | 28.8 | 59.1 | 16.1 |
| DeepSeek-R1-Distill-32B | 70.0 | 46.7 | 59.6 | - |
| DeepSeek-R1-Zero (671B) | 71.0 | 53.5 | 59.6 | - |
| Agentic Reasoning | ||||
| Qwen2.5-7B-Instruct | 4.8 | 5.6 | 25.5 | 12.2 |
| Qwen3-4B-Instruct-2507 | 17.9 | 16.3 | 44.3 | 23.0 |
| ToRL-7B | 43.3 | 30.0 | - | - |
| ReTool-32B | 72.5 | 54.3 | - | - |
| Tool-Star-3B | 20.0 | 16.7 | - | - |
| ARPO-7B | 30.0 | 30.0 | 53.0 | 18.3 |
| rStar2-Agent-14B | 80.6 | 69.8 | 60.9 | - |
| DemyAgent-4B (Ours) | 72.6 | 70.0 | 58.5 | 26.8 |
Key Highlights
β¨ Despite having only 4B parameters, DemyAgent-4B achieves:
- π₯ State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B)
- π₯ Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%)
- π Competitive performance against 14B-32B models with 4-8Γ fewer parameters
- π‘ Superior efficiency compared to long-CoT models through deliberative tool use
π Citation
@article{yu2025demystify,
title={Demystifying Reinforcement Learning in Agentic Reasoning},
author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
journal={arXiv preprint arXiv:2510.11701},
year={2025}
}
π Acknowledgements
This work aims to explore more efficient paradigms for Agentic RL. Our implementation builds upon the excellent codebases of VeRL and ReTool. We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.