RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

arXiv Paper Model GitHub Project Page

Model Overview

The Qwen3-4B-RODS model is a high-performance Large Language Model (LLM) fine-tuned for complex, multi-turn Function Calling (FC) and agentic tool-use tasks. Built upon the Qwen3-4B-Instruct base model, it has been trained using the novel RODS (Reward-driven Online Data Synthesis) framework combined with GRPO reinforcement learning.

RODS closes the loop between RL training and data generation: it repurposes the progress reward variance as a zero-cost capability boundary detector, continuously synthesizes structurally isomorphic training data at the agent's learning frontier, and manages a dynamic replay buffer that co-evolves with the policy. Starting from only 400 human-annotated seeds, RODS achieves strong multi-turn tool-use performance with extreme data efficiency.

  • Base Model: Qwen3-4B-Instruct
  • Size: 4 Billion parameters
  • Key Capability: Advanced Multi-Turn Function Calling and Agentic Tool-Use

Evaluation Results

The model was evaluated on the Berkeley Function-Calling Leaderboard (BFCL).

BFCLv3 Multi-Turn Performance

Model Size Multi-Turn (Overall) Base Miss Func Miss Param Long Context
Qwen3-4B-Instruct (Base) 4B 22.13 26.50 21.00 15.50 25.50
Qwen3-4B + RODS (ours) 4B 56.00 68.00 59.00 44.00 53.00
Claude-Sonnet-4-5-20250929 - 61.38 69.00 65.00 52.50 59.00
Grok-4-1-fast-reasoning - 58.88 70.50 59.50 43.00 62.50
Kimi-K2-Instruct 1043B 50.63 62.00 41.00 44.50 55.00
Qwen3-32B 32B 47.88 56.00 52.50 40.00 43.00
DeepSeek-V3.2-Exp 671B 44.88 55.00 49.00 27.00 48.50
GPT-4o-2024-11-20 - 42.50 55.50 34.50 29.00 51.00

Training Data and Framework

RODS Framework

RODS is a closed-loop RL-data synthesis framework with three co-evolving modules:

  1. Reward-Based Boundary Detection: Uses GRPO rollout reward variance as a zero-cost probe to identify tasks at the agent's capability boundary, where gradient signal is richest.
  2. Skill-Aligned Synthesis Pipeline: A multi-agent pipeline (Planner → Executor → Rewriter → Critic) generates structurally isomorphic variants that preserve API topology and dependency depth while introducing novel narratives and environment states.
  3. Dynamic Replay Buffer Management: A dual-control lifecycle with staged injection and multi-layer retirement keeps the training pool anchored at the shifting capability boundary.

Training Details

  • Method: GRPO (Group Relative Policy Optimization)
  • Rollouts: K=16 per prompt
  • Training stages:
    1. Format training (100 Base samples, format reward)
    2. Base reasoning (100 Base samples, progress reward)
    3. Full expansion (400 samples + dynamic synthesis, progress reward)
  • Synthesis backbone: Qwen3-32B via vLLM
  • Hardware: 8x A100 (training) + 8x A100 (synthesis)
  • Active training pool: ~800 samples (400 seeds + up to 400 generated)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RuishanFang/Qwen3-4B-RODS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

For tool-use inference, follow the Qwen3 function calling format. The model expects tools to be provided in the system prompt and generates structured <tool_call> responses.


Related Projects and Citation

This work is part of the open-source project AWorld, InclusionAI.

If you use RODS in your research, please cite:

@article{fang2026rods,
  title={RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents},
  author={Fang, Ruishan and Lu, Siyuan and Zhuang, Chenyi and Lin, Tao},
  journal={arXiv preprint arXiv:2606.19047},
  year={2026}
}

Contact

For inquiries, please contact:

  • fangruishan@westlake.edu.cn
Downloads last month
2
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RuishanFang/Qwen3-4B-RODS

Finetuned
Qwen/Qwen3-4B
Finetuned
(724)
this model

Dataset used to train RuishanFang/Qwen3-4B-RODS

Paper for RuishanFang/Qwen3-4B-RODS

Evaluation results