AReaL-SEA-235B-A22B — Interactive Tool-Using Agent

AReaL-SEA-235B-A22B is a multi-turn interactive tool-using agent fine-tuned from Qwen3-235B-A22B-Thinking-2507 using SFT + Reinforcement Learning with Verifiable Rewards.

Highlights

  • Achieves 81.3% average pass^1 across all three τ²-bench domains, surpassing GPT-5 (80.0%) and Qwen3-Max-Thinking (80.7%).
  • Trained entirely on self-evolving synthetic data — no human annotation required.
  • End-to-end post-training (SFT → RL) powered by AReaL, using fully asynchronous GRPO with trajectory-level group-relative advantages and dynamic filtering.

Performance

Mix training results on τ²-bench (trained on combined data from all three domains):

Domain pass^1 pass^2 pass^3 pass^4 pass@4
Airline 71.0 68.0 66.5 66.0 80.0
Retail 79.0 67.5 63.5 57.9 95.6
Telecom 93.0 88.6 81.6 81.6 100.0
Average 81.3 74.7 70.5 68.5 91.9

Comparison with Frontier Models

Model Airline p^1 Retail p^1 Telecom p^1 Avg p^1
AReaL-SEA-235B-A22B 71.0 79.0 93.0 81.3
Gemini 3.0 Pro 73.0 85.3 98.0 85.4
Claude-Sonnet-4.5 70.0 86.2 98.0 84.7
GPT-5 62.5 81.6 95.8 80.0
Qwen3-Max-Thinking 71.0 75.4 95.8 80.7
Deepseek-v3.2 63.8 81.1 96.2 80.4

Training

Method

  1. Synthetic Data Generation: A hierarchical self-evolving multi-agent framework generates multi-turn tool-use dialogues with executable per-instance verification functions, covering three domains: Airline, Retail, and Telecom.
  2. Supervised Fine-Tuning (SFT): The base model is first fine-tuned on the synthetic dialogues.
  3. Reinforcement Learning (GRPO): The SFT checkpoint is further trained via GRPO with trajectory-level group-relative advantages, dynamic filtering, and verifier-based outcome rewards. A fine-tuned user model ensures stable rollouts.

Infrastructure

All RL training is conducted using the AReaL framework on 80 H200 GPUs (10 nodes). AReaL's fully asynchronous pipeline decouples rollout generation from policy training, maximizing GPU utilization for large-scale multi-turn agentic RL.

Hyperparameters

SFT RL
Batch Size 128 256 (16×16)
Learning Rate 1e-5 1e-5
Epochs / Steps 10 epochs —
Max Context Length 32,768 32,768
Max Gen Tokens / Turn — 8,192
Temperature — 1.0

Training Data

This repo includes the synthetic training data:

File Description Samples
sft_merge.jsonl SFT training data (all 3 domains) 33,531
rl_merge.jsonl RL training data with verification functions 1,982
tau2_rl_database/ Environment database states for RL rollouts —

Data Format

Each sample in rl_merge.jsonl contains:

  • id: Unique task identifier (e.g., airline_1, telecom_1)
  • user_scenario: User persona, instructions, known information, and behavioral guidance
  • evaluation_criteria: Ground-truth action sequences and assertion-based verification functions
  • db_path: Path to the corresponding environment database

Usage

The model can be used as a drop-in replacement for any Qwen3-235B-A22B-compatible inference setup. For τ²-bench evaluation:

# Follow the τ²-bench evaluation protocol
# Use GPT-4.1 as user simulator for fair comparison
# Report pass^k metrics (all k attempts must succeed)

For integration with the AReaL training framework, refer to the Tau2 Customer Service example.

Citation

@article{gao2025sea,
  title={From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents},
  author={Gao, Jiaxuan and Chen, Jiaao and He, Chuyi and Wang, Wei-Chen and Xu, Shusheng and Wang, Hanrui and Jin, Di and Wu, Yi},
  journal={arXiv preprint arXiv:2601.22607},
  year={2025}
}

@article{fu2025areal,
  title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
  author={Fu, Wei and Gao, Jiaxuan and Shen, Xujie and Zhu, Chen and Mei, Zhiyu and He, Chuyi and Xu, Shusheng and Wei, Guo and Mei, Jun and Wang, Jiashu and Yang, Tongkai and Yuan, Binhang and Wu, Yi},
  journal={arXiv preprint arXiv:2505.24298},
  year={2025}
}
Downloads last month
26
Safetensors
Model size
235B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inclusionAI/AReaL-SEA-235B-A22B

Finetuned
(18)
this model

Collection including inclusionAI/AReaL-SEA-235B-A22B

Papers for inclusionAI/AReaL-SEA-235B-A22B