agentflow-planner-7b / README.md

nielsr HF Staff

Improve model card for AgentFlow (Qwen-2.5-7B-Instruct Backbone)

f429736 verified 4 months ago

preview code

raw

history blame

9.78 kB

metadata

library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
language: en
base_model: Qwen/Qwen2-7B-Instruct
tags:
  - llm
  - agent
  - tool-use
  - planning
  - qwen2
  - reinforcement-learning

AgentFlow: In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Model Details

Model Description

AgentFlow is a trainable, in-the-flow agentic framework that coordinates four specialized modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. This system addresses the limitations of prevailing tool-augmented approaches that often scale poorly with long horizons and diverse tools, and generalize weakly to new scenarios.

To enable effective planning and tool use, AgentFlow introduces Flow-based Group Refined Policy Optimization (Flow-GRPO), a novel algorithm that tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. The model leverages a Qwen-2.5-7B-Instruct backbone.

Developed by: Zhuofeng Li, Haoxiang Zhang, Pan Lu, and others.
Model type: Large Language Model with Agentic Capabilities
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model: Qwen/Qwen2-7B-Instruct

Model Sources

Repository: https://github.com/lupantech/AgentFlow
Paper: https://huggingface.co/papers/2510.05592
Project Page: https://agentflow.stanford.edu/
Demo: https://huggingface.co/spaces/AgentFlow/agentflow

Uses

Direct Use

AgentFlow is intended for researchers and developers working on advanced AI agents and large language models that require dynamic planning and effective utilization of external tools. It is particularly suitable for:

Complex reasoning tasks that demand multi-turn interaction and robust credit assignment.
Developing systems capable of autonomous skill discovery and practice in live environments.
Benchmarking and advancing the state-of-the-art in agentic LLM research.

Out-of-Scope Use

The model is not intended for:

Deployment in high-stakes, safety-critical applications without extensive additional fine-tuning, validation, and human oversight.
Generating content that is harmful, unethical, or violates privacy.
Tasks outside the scope of text-based reasoning and tool use without further adaptation or integration with other modalities.

Bias, Risks, and Limitations

AgentFlow, like other large language models, may exhibit biases present in its training data or the tools it integrates. Potential risks and limitations include:

Hallucination: The model might generate factually incorrect or nonsensical outputs, especially in complex scenarios or when tool outputs are ambiguous.
Tool Misuse/Over-reliance: Incorrectly invoking tools, misinterpreting tool outputs, or failing to identify appropriate tools for a given task.
Generalization Gaps: While designed for generalization, performance might degrade on tasks significantly different from its training distribution.
Long-horizon Challenges: Although designed to address long horizons, extremely long and complex tasks may still pose challenges for effective planning and execution.
API Key Dependency: The system's functionality heavily relies on external API keys (e.g., Google, OpenAI, DashScope), which might incur costs or introduce external dependencies.

Recommendations

Users of AgentFlow should:

Be aware of the potential for biases and hallucinations inherited from the underlying LLM and training data.
Carefully validate outputs, especially for critical applications.
Thoroughly test the system's behavior in specific deployment contexts.
Refer to the AgentFlow GitHub repository for detailed setup, configuration, and best practices to mitigate risks.

How to Get Started with the Model

AgentFlow provides a modular agentic system with four specialized modules (planner, executor, verifier, generator) that coordinate through evolving memory and a toolkit over multiple turns to solve complex reasoning tasks.

To quickly experience the system in action, follow the installation and environment setup instructions on the AgentFlow GitHub repository. Once your environment variables and API keys are configured, you can use the following Python code snippet for inference:

# Import the solver
from agentflow.agentflow.solver import construct_solver

# Set the LLM engine name (e.g., "dashscope" or "together")
llm_engine_name = "dashscope"

# Construct the solver
solver = construct_solver(llm_engine_name=llm_engine_name)

# Solve the user query
output = solver.solve("What is the capital of France?")
print(output["direct_output"])

Training Details

Training Data

AgentFlow is trained on a mixed dataset for diverse reasoning tasks:

NQ (Natural Questions): Used for agentic search tasks. (Link: https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets)
DeepMath-103K: Used for mathematical reasoning tasks. (Link: https://huggingface.co/datasets/zwhe99/DeepMath-103K)

Detailed scripts for dataset preparation (get_train_data.py, aime24_data.py) are available in the GitHub repository.

Training Procedure

AgentFlow employs Flow-based Group Refined Policy Optimization (Flow-GRPO), which directly optimizes the planner agent within the multi-turn interaction loop in an online fashion. This method converts multi-turn optimization into a sequence of tractable single-turn policy updates.

Training Hyperparameters

All training hyperparameters (model settings, tools, RL parameters, resources) are configurable via train/config.yaml in the GitHub repository.

Evaluation

Testing Data, Factors & Metrics

Testing Data

AgentFlow was evaluated across ten benchmarks covering various domains:

Search tasks
Agentic reasoning tasks
Mathematical tasks
Scientific tasks

Metrics

The primary metric for evaluation is accuracy.

Results

AgentFlow, utilizing a 7B-scale backbone (Qwen-2.5-7B-Instruct), demonstrates significant performance gains over top-performing baselines across multiple benchmarks:

+14.9% average accuracy gain on search tasks.
+14.0% average accuracy gain on agentic reasoning tasks.
+14.5% average accuracy gain on mathematical tasks.
+4.1% average accuracy gain on scientific tasks.

Notably, AgentFlow even surpassed larger proprietary models like GPT-4o on these benchmarks. Further analysis indicates improved planning, enhanced tool-calling reliability, and positive scaling trends with model size and reasoning turns.

For a more in-depth understanding of the evaluation protocols and detailed results, please refer to the paper and the project page.

Acknowledgements

We thank the following open-source projects:

verl for the excellent RL framework design.
VLLM for fast LLM inference support.
Ver-Tool and agent-lightning for their early-stage exploration in agentic RL Training.

We thank Lambda for GPU support!

Citation

@article{li2025intheflow,
    title = {In-the-Flow Agentic System Optimization for Effective Planning and Tool Use},
    author = {Li, Zhuofeng and Zhang, Haoxiang and Han, Seungju and Liu, Sheng and Xie, Jianwen and Zhang, Yu and Choi, Yejin and Zou, James and Lu, Pan},
    journal = {arXiv preprint arXiv:2510.05592},
    year = {2025}
}