AReaL-SEA-235B-A22B — Interactive Tool-Using Agent
AReaL-SEA-235B-A22B is a multi-turn interactive tool-using agent fine-tuned from Qwen3-235B-A22B-Thinking-2507 using SFT + Reinforcement Learning with Verifiable Rewards.
- Paper: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
- Training Framework: AReaL — A Large-Scale Asynchronous Reinforcement Learning System
- Benchmark: τ²-bench
Highlights
- Achieves 81.3% average pass^1 across all three τ²-bench domains, surpassing GPT-5 (80.0%) and Qwen3-Max-Thinking (80.7%).
- Trained entirely on self-evolving synthetic data — no human annotation required.
- End-to-end post-training (SFT → RL) powered by AReaL, using fully asynchronous GRPO with trajectory-level group-relative advantages and dynamic filtering.
Performance
Mix training results on τ²-bench (trained on combined data from all three domains):
| Domain | pass^1 | pass^2 | pass^3 | pass^4 | pass@4 |
|---|---|---|---|---|---|
| Airline | 71.0 | 68.0 | 66.5 | 66.0 | 80.0 |
| Retail | 79.0 | 67.5 | 63.5 | 57.9 | 95.6 |
| Telecom | 93.0 | 88.6 | 81.6 | 81.6 | 100.0 |
| Average | 81.3 | 74.7 | 70.5 | 68.5 | 91.9 |
Comparison with Frontier Models
| Model | Airline p^1 | Retail p^1 | Telecom p^1 | Avg p^1 |
|---|---|---|---|---|
| AReaL-SEA-235B-A22B | 71.0 | 79.0 | 93.0 | 81.3 |
| Gemini 3.0 Pro | 73.0 | 85.3 | 98.0 | 85.4 |
| Claude-Sonnet-4.5 | 70.0 | 86.2 | 98.0 | 84.7 |
| GPT-5 | 62.5 | 81.6 | 95.8 | 80.0 |
| Qwen3-Max-Thinking | 71.0 | 75.4 | 95.8 | 80.7 |
| Deepseek-v3.2 | 63.8 | 81.1 | 96.2 | 80.4 |
Training
Method
- Synthetic Data Generation: A hierarchical self-evolving multi-agent framework generates multi-turn tool-use dialogues with executable per-instance verification functions, covering three domains: Airline, Retail, and Telecom.
- Supervised Fine-Tuning (SFT): The base model is first fine-tuned on the synthetic dialogues.
- Reinforcement Learning (GRPO): The SFT checkpoint is further trained via GRPO with trajectory-level group-relative advantages, dynamic filtering, and verifier-based outcome rewards. A fine-tuned user model ensures stable rollouts.
Infrastructure
All RL training is conducted using the AReaL framework on 80 H200 GPUs (10 nodes). AReaL's fully asynchronous pipeline decouples rollout generation from policy training, maximizing GPU utilization for large-scale multi-turn agentic RL.
Hyperparameters
| SFT | RL | |
|---|---|---|
| Batch Size | 128 | 256 (16×16) |
| Learning Rate | 1e-5 | 1e-5 |
| Epochs / Steps | 10 epochs | — |
| Max Context Length | 32,768 | 32,768 |
| Max Gen Tokens / Turn | — | 8,192 |
| Temperature | — | 1.0 |
Training Data
This repo includes the synthetic training data:
| File | Description | Samples |
|---|---|---|
sft_merge.jsonl |
SFT training data (all 3 domains) | 33,531 |
rl_merge.jsonl |
RL training data with verification functions | 1,982 |
tau2_rl_database/ |
Environment database states for RL rollouts | — |
Data Format
Each sample in rl_merge.jsonl contains:
id: Unique task identifier (e.g.,airline_1,telecom_1)user_scenario: User persona, instructions, known information, and behavioral guidanceevaluation_criteria: Ground-truth action sequences and assertion-based verification functionsdb_path: Path to the corresponding environment database
Usage
The model can be used as a drop-in replacement for any Qwen3-235B-A22B-compatible inference setup. For τ²-bench evaluation:
# Follow the τ²-bench evaluation protocol
# Use GPT-4.1 as user simulator for fair comparison
# Report pass^k metrics (all k attempts must succeed)
For integration with the AReaL training framework, refer to the Tau2 Customer Service example.
Citation
@article{gao2025sea,
title={From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents},
author={Gao, Jiaxuan and Chen, Jiaao and He, Chuyi and Wang, Wei-Chen and Xu, Shusheng and Wang, Hanrui and Jin, Di and Wu, Yi},
journal={arXiv preprint arXiv:2601.22607},
year={2025}
}
@article{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Fu, Wei and Gao, Jiaxuan and Shen, Xujie and Zhu, Chen and Mei, Zhiyu and He, Chuyi and Xu, Shusheng and Wei, Guo and Mei, Jun and Wang, Jiashu and Yang, Tongkai and Yuan, Binhang and Wu, Yi},
journal={arXiv preprint arXiv:2505.24298},
year={2025}
}
- Downloads last month
- 26
Model tree for inclusionAI/AReaL-SEA-235B-A22B
Base model
Qwen/Qwen3-235B-A22B-Thinking-2507