DemyAgent-4B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add metadata, update paper link and add project page link
a30b6e9 verified
|
raw
history blame
6.76 kB
metadata
pipeline_tag: text-generation
library_name: transformers
license: cc-by-nc-4.0
tags:
  - agentic-reasoning
  - tool-use
  - LLM
  - Qwen

Demystifying Reinforcement Learning in Agentic Reasoning

Paper: Demystifying Reinforcement Learning in Agentic Reasoning Project Page: Open-AgentRL Collection

🎯 About This Repository

This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies.

🌟 Introduction

In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal:

  • 🎯 Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives
  • ⚑ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency
  • 🧠 Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks.

πŸ“¦ Resources

Type Name Link
πŸ“Š Dataset 3K Agentic SFT Data πŸ€— HuggingFace
πŸ“Š Dataset 30K Agentic RL Data πŸ€— HuggingFace
πŸ€– Model Qwen2.5-7B-RA-SFT πŸ€— HuggingFace
πŸ€– Model Qwen3-4B-RA-SFT πŸ€— HuggingFace
πŸ€– Model DemyAgent-4B πŸ€— HuggingFace

Note:

  • Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data
  • DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe

πŸ† Performance

We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks.

Benchmark Results

MATH Science Code
Method AIME2024 AIME2025 GPQA-Diamond LiveCodeBench-v6
Self-Contained Reasoning
Qwen2.5-7B-Instruct 16.7 10.0 31.3 15.2
Qwen3-4B-Instruct-2507 63.3 47.4 52.0 35.1
Qwen2.5-72B-Instruct 18.9 15.0 49.0 -
DeepSeek-V3 39.2 28.8 59.1 16.1
DeepSeek-R1-Distill-32B 70.0 46.7 59.6 -
DeepSeek-R1-Zero (671B) 71.0 53.5 59.6 -
Agentic Reasoning
Qwen2.5-7B-Instruct 4.8 5.6 25.5 12.2
Qwen3-4B-Instruct-2507 17.9 16.3 44.3 23.0
ToRL-7B 43.3 30.0 - -
ReTool-32B 72.5 54.3 - -
Tool-Star-3B 20.0 16.7 - -
ARPO-7B 30.0 30.0 53.0 18.3
rStar2-Agent-14B 80.6 69.8 60.9 -
DemyAgent-4B (Ours) 72.6 70.0 58.5 26.8

Key Highlights

✨ Despite having only 4B parameters, DemyAgent-4B achieves:

  • πŸ₯‡ State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B)
  • πŸ₯ˆ Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%)
  • πŸš€ Competitive performance against 14B-32B models with 4-8Γ— fewer parameters
  • πŸ’‘ Superior efficiency compared to long-CoT models through deliberative tool use

πŸ“ Citation

@article{yu2025demystify,
  title={Demystifying Reinforcement Learning in Agentic Reasoning},
  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
  journal={arXiv preprint arXiv:2510.11701},
  year={2025}
}

πŸ™ Acknowledgements

This work aims to explore more efficient paradigms for Agentic RL. Our implementation builds upon the excellent codebases of VeRL and ReTool. We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.