Update README.md

33024f3 verified 19 days ago

1.89 kB

base_model:
  - Qwen/Qwen2.5-7B-Instruct

Reproduces the core idea of AgentFlow: extending single-step LLM inference into a multi-turn Planner → Executor → Verifier agent loop, applying RL signals (GRPO) to the Planner's generation trajectory. This allows the model to improve its tool-use and reasoning capabilities without requiring manually annotated intermediate steps.

Our code hub: https://github.com/LMIS-ORG/slime-agentic?tab=readme-ov-file

Architecture

Input question
  │
  ▼
Planner.plan()              ← Analyze the problem and devise a solution strategy (loss_mask=1)
  │
  └─► for step in range(max_steps):
        │
        ├─ Planner.generate_next_step()             ← Select next tool and sub-goal (loss_mask=1)
        ├─ Executor.generate_tool_command()
        │  + execute_command()                       ← Invoke tool (excluded from sequence)
        ├─ Verifier.verificate_context()             ← Decide whether to continue (excluded)
        └─ Memory.add_action()                       ← Record execution result
  │
  ▼
Planner.generate_final_output()   ← Summarize results and produce final answer (loss_mask=0)
  │
  ▼
Rewarder.compute_reward()         ← LLM-as-Judge: compare model answer with ground truth

Tools (`tools/`)

Tool	Description
`base_generator`	General-purpose text generation tool; answers sub-tasks directly via LLM
`python_coder`	Python code generation and execution tool for math computation and algorithmic problem solving

Results

Model	Dataset	Baseline	AgentFlow (Ours)	Improvement
Qwen2.5-7B-Instruct	AIME 2024	10.0%	30.0%	+20.0%

Note: Due to limited training resources, the AgentFlow model was only trained for 100 steps.

Architecture

Tools (tools/)

Results

Tools (`tools/`)