AgentFlow: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
## Model Details
### Model Description
AgentFlow is a **trainable, in-the-flow agentic framework** that coordinates four specialized modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. This system addresses the limitations of prevailing tool-augmented approaches that often scale poorly with long horizons and diverse tools, and generalize weakly to new scenarios.
To enable effective planning and tool use, AgentFlow introduces **Flow-based Group Refined Policy Optimization (Flow-GRPO)**, a novel algorithm that tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. The model leverages a **Qwen-2.5-7B-Instruct backbone**.
- **Developed by:** Zhuofeng Li, Haoxiang Zhang, Pan Lu, and others.
- **Model type:** Large Language Model with Agentic Capabilities
- **Language(s) (NLP):** English
- **License:** Apache-2.0
- **Finetuned from model:** Qwen/Qwen2-7B-Instruct
### Model Sources
- **Repository:** https://github.com/lupantech/AgentFlow
- **Paper:** https://huggingface.co/papers/2510.05592
- **Project Page:** https://agentflow.stanford.edu/
- **Demo:** https://huggingface.co/spaces/AgentFlow/agentflow
## Uses
### Direct Use
AgentFlow is intended for researchers and developers working on advanced AI agents and large language models that require dynamic planning and effective utilization of external tools. It is particularly suitable for:
* Complex reasoning tasks that demand multi-turn interaction and robust credit assignment.
* Developing systems capable of autonomous skill discovery and practice in live environments.
* Benchmarking and advancing the state-of-the-art in agentic LLM research.
### Out-of-Scope Use
The model is not intended for:
* Deployment in high-stakes, safety-critical applications without extensive additional fine-tuning, validation, and human oversight.
* Generating content that is harmful, unethical, or violates privacy.
* Tasks outside the scope of text-based reasoning and tool use without further adaptation or integration with other modalities.
## Bias, Risks, and Limitations
AgentFlow, like other large language models, may exhibit biases present in its training data or the tools it integrates. Potential risks and limitations include:
* **Hallucination:** The model might generate factually incorrect or nonsensical outputs, especially in complex scenarios or when tool outputs are ambiguous.
* **Tool Misuse/Over-reliance:** Incorrectly invoking tools, misinterpreting tool outputs, or failing to identify appropriate tools for a given task.
* **Generalization Gaps:** While designed for generalization, performance might degrade on tasks significantly different from its training distribution.
* **Long-horizon Challenges:** Although designed to address long horizons, extremely long and complex tasks may still pose challenges for effective planning and execution.
* **API Key Dependency:** The system's functionality heavily relies on external API keys (e.g., Google, OpenAI, DashScope), which might incur costs or introduce external dependencies.
### Recommendations
Users of AgentFlow should:
* Be aware of the potential for biases and hallucinations inherited from the underlying LLM and training data.
* Carefully validate outputs, especially for critical applications.
* Thoroughly test the system's behavior in specific deployment contexts.
* Refer to the [AgentFlow GitHub repository](https://github.com/lupantech/AgentFlow) for detailed setup, configuration, and best practices to mitigate risks.
## How to Get Started with the Model
AgentFlow provides a modular agentic system with **four specialized modules** (planner, executor, verifier, generator) that coordinate through **evolving memory** and a **toolkit** over **multiple turns** to solve complex reasoning tasks.
To quickly experience the system in action, follow the installation and environment setup instructions on the [AgentFlow GitHub repository](https://github.com/lupantech/AgentFlow). Once your environment variables and API keys are configured, you can use the following Python code snippet for inference:
```python
# Import the solver
from agentflow.agentflow.solver import construct_solver
# Set the LLM engine name (e.g., "dashscope" or "together")
llm_engine_name = "dashscope"
# Construct the solver
solver = construct_solver(llm_engine_name=llm_engine_name)
# Solve the user query
output = solver.solve("What is the capital of France?")
print(output["direct_output"])
```
## Training Details
### Training Data
AgentFlow is trained on a mixed dataset for diverse reasoning tasks:
* **NQ (Natural Questions)**: Used for agentic search tasks. (Link: [https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets))
* **DeepMath-103K**: Used for mathematical reasoning tasks. (Link: [https://huggingface.co/datasets/zwhe99/DeepMath-103K](https://huggingface.co/datasets/zwhe99/DeepMath-103K))
Detailed scripts for dataset preparation (`get_train_data.py`, `aime24_data.py`) are available in the [GitHub repository](https://github.com/lupantech/AgentFlow/tree/main/data).
### Training Procedure
AgentFlow employs **Flow-based Group Refined Policy Optimization (Flow-GRPO)**, which directly optimizes the planner agent within the multi-turn interaction loop in an online fashion. This method converts multi-turn optimization into a sequence of tractable single-turn policy updates.
#### Training Hyperparameters
All training hyperparameters (model settings, tools, RL parameters, resources) are configurable via `train/config.yaml` in the GitHub repository.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
AgentFlow was evaluated across ten benchmarks covering various domains:
* Search tasks
* Agentic reasoning tasks
* Mathematical tasks
* Scientific tasks
#### Metrics
The primary metric for evaluation is **accuracy**.
### Results
AgentFlow, utilizing a 7B-scale backbone (Qwen-2.5-7B-Instruct), demonstrates significant performance gains over top-performing baselines across multiple benchmarks:
- **+14.9%** average accuracy gain on search tasks.
- **+14.0%** average accuracy gain on agentic reasoning tasks.
- **+14.5%** average accuracy gain on mathematical tasks.
- **+4.1%** average accuracy gain on scientific tasks.
Notably, AgentFlow even surpassed larger proprietary models like GPT-4o on these benchmarks. Further analysis indicates improved planning, enhanced tool-calling reliability, and positive scaling trends with model size and reasoning turns.



For a more in-depth understanding of the evaluation protocols and detailed results, please refer to the [paper](https://huggingface.co/papers/2510.05592) and the [project page](https://agentflow.stanford.edu/).
## Acknowledgements
We thank the following open-source projects:
- [verl](https://github.com/volcengine/verl) for the excellent RL framework design.
- [VLLM](https://github.com/vllm-project/vllm) for fast LLM inference support.
- [Ver-Tool](https://github.com/TIGER-AI-Lab/verl-tool) and [agent-lightning](https://github.com/microsoft/agent-lightning) for their early-stage exploration in agentic RL Training.
We thank [Lambda](https://lambda.ai/careers) for GPU support!
## Citation
```bibtex
@article{li2025intheflow,
title = {In-the-Flow Agentic System Optimization for Effective Planning and Tool Use},
author = {Li, Zhuofeng and Zhang, Haoxiang and Han, Seungju and Liu, Sheng and Xie, Jianwen and Zhang, Yu and Choi, Yejin and Zou, James and Lu, Pan},
journal = {arXiv preprint arXiv:2510.05592},
year = {2025}
}
```