ToolCUA-8B
ToolCUA-8B is an end-to-end computer-use agent for orchestrating GUI actions and structured tool calls. It learns when to continue through GUI interaction, when to invoke tools, and when to switch back, enabling shorter and more reliable desktop task trajectories.
Method
ToolCUA uses a staged training pipeline for GUI-Tool path selection:
- Scale interleaved GUI-Tool trajectories from existing GUI-only data via trajectory-aware tool synthesis.
- Apply Tool-Bootstrapped GUI RFT to learn tool-calling behavior and calibrate local switching decisions.
- Optimize with Online Agentic RL in a GUI-Tool environment using a Tool-Efficient Path Reward for appropriate tool use and shorter execution paths.
Results
On feasible OSWorld-MCP tasks, ToolCUA-8B reaches 46.85% overall accuracy, 24.32% Tool Invocation Rate (TIR), and 14.93 Average Completion Steps (ACS). Compared with Qwen3-VL-8B-Instruct, it improves accuracy by +18.62, improves TIR by +15.91, and reduces ACS by 4.41 steps.
vLLM Serve
We recommend vLLM for deployment. Use vllm>=0.12.0 and enable --trust-remote-code.
MAX_IMAGE=${MAX_IMAGE:-5}
IMAGE_LIMIT_ARGS='{"image": '"$MAX_IMAGE"'}'
PIXEL_ARGS='{"size": {"longest_edge": 3072000, "shortest_edge": 65536}}'
vllm serve X-PLUG/ToolCUA-8B \
--trust-remote-code \
--max-model-len 32768 \
--mm-processor-kwargs "$PIXEL_ARGS" \
--limit-mm-per-prompt "$IMAGE_LIMIT_ARGS" \
--tensor-parallel-size 1 \
--allowed-local-media-path '/' \
--port 4243 \
--gpu-memory-utilization 0.85 \
--mm-processor-cache-gb 0 \
--no-enable-prefix-caching \
--enforce-eager \
--max-logprobs 50
Citation
@article{hu2026toolcua,
title={ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents},
author={Hu, Xuhao and Zhang, Xi and Xu, Haiyang and Qiao, Kyle and Yang, Jingyi and Huang, Xuanjing and Shao, Jing and Yan, Ming and Ye, Jieping},
journal={arXiv preprint arXiv:2605.12481},
year={2026}
}
- Downloads last month
- 78
Model tree for mPLUG/ToolCUA-8B
Paper for mPLUG/ToolCUA-8B
Paper โข 2605.12481 โข Published โข 23