Instructions to use Knowurknot/UI-TARS-1.5-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Knowurknot/UI-TARS-1.5-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Knowurknot/UI-TARS-1.5-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Knowurknot/UI-TARS-1.5-7B") model = AutoModelForImageTextToText.from_pretrained("Knowurknot/UI-TARS-1.5-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Knowurknot/UI-TARS-1.5-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Knowurknot/UI-TARS-1.5-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Knowurknot/UI-TARS-1.5-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Knowurknot/UI-TARS-1.5-7B
- SGLang
How to use Knowurknot/UI-TARS-1.5-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Knowurknot/UI-TARS-1.5-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Knowurknot/UI-TARS-1.5-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Knowurknot/UI-TARS-1.5-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Knowurknot/UI-TARS-1.5-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Knowurknot/UI-TARS-1.5-7B with Docker Model Runner:
docker model run hf.co/Knowurknot/UI-TARS-1.5-7B
UI-TARS-1.5 Model
We shared the latest progress of the UI-TARS-1.5 model in our blog, which excels in playing games and performing GUI tasks.
Introduction
UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.
Code: https://github.com/bytedance/UI-TARS
Application: https://github.com/bytedance/UI-TARS-desktop
Performance
Online Benchmark Evaluation
| Benchmark type | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
|---|---|---|---|---|---|
| Computer Use | OSworld (100 steps) | 42.5 | 36.4 | 28 | 38.1 (200 step) |
| Windows Agent Arena (50 steps) | 42.1 | - | - | 29.8 | |
| Browser Use | WebVoyager | 84.8 | 87 | 84.1 | 87 |
| Online-Mind2web | 75.8 | 71 | 62.9 | 71 | |
| Phone Use | Android World | 64.2 | - | - | 59.5 |
Grounding Capability Evaluation
| Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
|---|---|---|---|---|
| ScreensSpot-V2 | 94.2 | 87.9 | 87.6 | 91.6 |
| ScreenSpotPro | 61.6 | 23.4 | 27.7 | 43.6 |
Poki Game
| Model | 2048 | cubinko | energy | free-the-key | Gem-11 | hex-frvr | Infinity-Loop | Maze:Path-of-Light | shapes | snake-solver | wood-blocks-3d | yarn-untangle | laser-maze-puzzle | tiles-master |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI CUA | 31.04 | 0.00 | 32.80 | 0.00 | 46.27 | 92.25 | 23.08 | 35.00 | 52.18 | 42.86 | 2.02 | 44.56 | 80.00 | 78.27 |
| Claude 3.7 | 43.05 | 0.00 | 41.60 | 0.00 | 0.00 | 30.76 | 2.31 | 82.00 | 6.26 | 42.86 | 0.00 | 13.77 | 28.00 | 52.18 |
| UI-TARS-1.5 | 100.00 | 0.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Minecraft
| Task Type | Task Name | VPT | DreamerV3 | Previous SOTA | UI-TARS-1.5 w/o Thought | UI-TARS-1.5 w/ Thought |
|---|---|---|---|---|---|---|
| Mine Blocks | (oak_log) | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 |
| (obsidian) | 0.0 | 0.0 | 0.0 | 0.2 | 0.3 | |
| (white_bed) | 0.0 | 0.0 | 0.1 | 0.4 | 0.6 | |
| 200 Tasks Avg. | 0.06 | 0.03 | 0.32 | 0.35 | 0.42 | |
| Kill Mobs | (mooshroom) | 0.0 | 0.0 | 0.1 | 0.3 | 0.4 |
| (zombie) | 0.4 | 0.1 | 0.6 | 0.7 | 0.9 | |
| (chicken) | 0.1 | 0.0 | 0.4 | 0.5 | 0.6 | |
| 100 Tasks Avg. | 0.04 | 0.03 | 0.18 | 0.25 | 0.31 |
Model Scale Comparison
This table compares performance across different model scales of UI-TARS on the OSworld benchmark.
| Benchmark Type | Benchmark | UI-TARS-72B-DPO | UI-TARS-1.5-7B | UI-TARS-1.5 |
|---|---|---|---|---|
| Computer Use | OSWorld | 24.6 | 27.5 | 42.5 |
| GUI Grounding | ScreenSpotPro | 38.1 | 49.6 | 61.6 |
The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.
What's next
We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at TARS@bytedance.com.
Citation
If you find our paper and model useful in your research, feel free to give us a cite.
@article{qin2025ui,
title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
journal={arXiv preprint arXiv:2501.12326},
year={2025}
}
- Downloads last month
- 21