DemyAgent-4B / README.md

nielsr HF Staff

Improve model card: Add metadata, update paper link and add project page link

a30b6e9 verified 6 months ago

6.76 kB

pipeline_tag: text-generation
library_name: transformers
license: cc-by-nc-4.0
tags:
  - agentic-reasoning
  - tool-use
  - LLM
  - Qwen

Demystifying Reinforcement Learning in Agentic Reasoning

Paper: Demystifying Reinforcement Learning in Agentic Reasoning Project Page: Open-AgentRL Collection

🎯 About This Repository

This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies.

🌟 Introduction

In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal:

🎯 Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives
⚡ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency
🧠 Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks.

📦 Resources

Type	Name	Link
📊 Dataset	3K Agentic SFT Data	🤗 HuggingFace
📊 Dataset	30K Agentic RL Data	🤗 HuggingFace
🤖 Model	Qwen2.5-7B-RA-SFT	🤗 HuggingFace
🤖 Model	Qwen3-4B-RA-SFT	🤗 HuggingFace
🤖 Model	DemyAgent-4B	🤗 HuggingFace

Note:

Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data

DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe

🏆 Performance

We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks.

Benchmark Results

	MATH		Science	Code
Method	AIME2024	AIME2025	GPQA-Diamond	LiveCodeBench-v6
Self-Contained Reasoning
Qwen2.5-7B-Instruct	16.7	10.0	31.3	15.2
Qwen3-4B-Instruct-2507	63.3	47.4	52.0	35.1
Qwen2.5-72B-Instruct	18.9	15.0	49.0	-
DeepSeek-V3	39.2	28.8	59.1	16.1
DeepSeek-R1-Distill-32B	70.0	46.7	59.6	-
DeepSeek-R1-Zero (671B)	71.0	53.5	59.6	-
Agentic Reasoning
Qwen2.5-7B-Instruct	4.8	5.6	25.5	12.2
Qwen3-4B-Instruct-2507	17.9	16.3	44.3	23.0
ToRL-7B	43.3	30.0	-	-
ReTool-32B	72.5	54.3	-	-
Tool-Star-3B	20.0	16.7	-	-
ARPO-7B	30.0	30.0	53.0	18.3
rStar2-Agent-14B	80.6	69.8	60.9	-
DemyAgent-4B (Ours)	72.6	70.0	58.5	26.8

Key Highlights

✨ Despite having only 4B parameters, DemyAgent-4B achieves:

🥇 State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B)
🥈 Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%)
🚀 Competitive performance against 14B-32B models with 4-8× fewer parameters
💡 Superior efficiency compared to long-CoT models through deliberative tool use

📝 Citation

@article{yu2025demystify,
  title={Demystifying Reinforcement Learning in Agentic Reasoning},
  author={Yu, Zhaochen and Yang, Ling and Zou, Jiaru and Yan, Shuicheng and Wang, Mengdi},
  journal={arXiv preprint arXiv:2510.11701},
  year={2025}
}

🙏 Acknowledgements

This work aims to explore more efficient paradigms for Agentic RL. Our implementation builds upon the excellent codebases of VeRL and ReTool. We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.