AReaL-SEA-235B-A22B — Interactive Tool-Using Agent

AReaL-SEA-235B-A22B is a multi-turn interactive tool-using agent fine-tuned from Qwen3-235B-A22B-Thinking-2507 using SFT + Reinforcement Learning with Verifiable Rewards.

Paper: From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
Training Framework: AReaL — A Large-Scale Asynchronous Reinforcement Learning System
Benchmark: τ²-bench

Highlights

Achieves 81.3% average pass^1 across all three τ²-bench domains, surpassing GPT-5 (80.0%) and Qwen3-Max-Thinking (80.7%).
Trained entirely on self-evolving synthetic data — no human annotation required.
End-to-end post-training (SFT → RL) powered by AReaL, using fully asynchronous GRPO with trajectory-level group-relative advantages and dynamic filtering.

Performance

Mix training results on τ²-bench (trained on combined data from all three domains):

Domain	pass^1	pass^2	pass^3	pass^4	pass@4
Airline	71.0	68.0	66.5	66.0	80.0
Retail	79.0	67.5	63.5	57.9	95.6
Telecom	93.0	88.6	81.6	81.6	100.0
Average	81.3	74.7	70.5	68.5	91.9

Comparison with Frontier Models

Model	Airline p^1	Retail p^1	Telecom p^1	Avg p^1
AReaL-SEA-235B-A22B	71.0	79.0	93.0	81.3
Gemini 3.0 Pro	73.0	85.3	98.0	85.4
Claude-Sonnet-4.5	70.0	86.2	98.0	84.7
GPT-5	62.5	81.6	95.8	80.0
Qwen3-Max-Thinking	71.0	75.4	95.8	80.7
Deepseek-v3.2	63.8	81.1	96.2	80.4

Training

Method

Synthetic Data Generation: A hierarchical self-evolving multi-agent framework generates multi-turn tool-use dialogues with executable per-instance verification functions, covering three domains: Airline, Retail, and Telecom.
Supervised Fine-Tuning (SFT): The base model is first fine-tuned on the synthetic dialogues.
Reinforcement Learning (GRPO): The SFT checkpoint is further trained via GRPO with trajectory-level group-relative advantages, dynamic filtering, and verifier-based outcome rewards. A fine-tuned user model ensures stable rollouts.

Infrastructure

All RL training is conducted using the AReaL framework on 80 H200 GPUs (10 nodes). AReaL's fully asynchronous pipeline decouples rollout generation from policy training, maximizing GPU utilization for large-scale multi-turn agentic RL.

Hyperparameters

	SFT	RL
Batch Size	128	256 (16×16)
Learning Rate	1e-5	1e-5
Epochs / Steps	10 epochs	—
Max Context Length	32,768	32,768
Max Gen Tokens / Turn	—	8,192
Temperature	—	1.0

Training Data

This repo includes the synthetic training data:

File	Description	Samples
`sft_merge.jsonl`	SFT training data (all 3 domains)	33,531
`rl_merge.jsonl`	RL training data with verification functions	1,982
`tau2_rl_database/`	Environment database states for RL rollouts	—

Data Format

Each sample in rl_merge.jsonl contains:

id: Unique task identifier (e.g., airline_1, telecom_1)
user_scenario: User persona, instructions, known information, and behavioral guidance
evaluation_criteria: Ground-truth action sequences and assertion-based verification functions
db_path: Path to the corresponding environment database

Usage

The model can be used as a drop-in replacement for any Qwen3-235B-A22B-compatible inference setup. For τ²-bench evaluation:

# Follow the τ²-bench evaluation protocol
# Use GPT-4.1 as user simulator for fair comparison
# Report pass^k metrics (all k attempts must succeed)

For integration with the AReaL training framework, refer to the Tau2 Customer Service example.

Citation

@article{gao2025sea,
  title={From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents},
  author={Gao, Jiaxuan and Chen, Jiaao and He, Chuyi and Wang, Wei-Chen and Xu, Shusheng and Wang, Hanrui and Jin, Di and Wu, Yi},
  journal={arXiv preprint arXiv:2601.22607},
  year={2025}
}

@article{fu2025areal,
  title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
  author={Fu, Wei and Gao, Jiaxuan and Shen, Xujie and Zhu, Chen and Mei, Zhiyu and He, Chuyi and Xu, Shusheng and Wei, Guo and Mei, Jun and Wang, Jiashu and Yang, Tongkai and Yuan, Binhang and Wu, Yi},
  journal={arXiv preprint arXiv:2505.24298},
  year={2025}
}