STT-Arena
Collection
benchmark data, training data, and STT-Agent from our paper "STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics" β’ 4 items β’ Updated β’ 1
This repository contains the STT-Agent-RL model throught online RL training based on STT-Agent-SFT.
Below is the overall Pass@1 performance of STT-Agent compared to other frontier models:
| Model | Easy | Medium | Hard | Impossible | Overall | Avg. Calls |
|---|---|---|---|---|---|---|
| Qwen-3-4B (baseline) | 18.31 | 9.46 | 2.82 | 10.00 | 10.57 | 7.63 |
| STT-Agent (w/o refine) | 28.17 | 16.92 | 11.86 | 47.01 | 23.10 | 32.70 |
| {model_name} (with refine) | 26.76 | 17.41 | 13.56 | 61.11 | 25.11 | 15.30 |
Trajectory refinement significantly improves both accuracy and efficiency (reduces average API calls).
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "{model_name}"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example tool-use prompt
prompt = "User: Book the cheapest flight from PVG to CDG.\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))
Base model: Qwen-3-4B-Base SFT: 2,212 refined trajectories RL strategy: REINFORCE++ Compute: 4Γ NVIDIA H200 GPUs
@misc{hui2026sttarenarealisticenvironmenttoolusing,
title={STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics},
author={Tingfeng Hui and Hao Xu and Pengyu Zhu and Hongsheng Xin and Kun Zhan and Sen Su and Chunxiao Liu and Ning Miao},
year={2026},
eprint={2605.18548},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.18548},
}