Add link to paper
#3
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -9,7 +9,6 @@ tags:
|
|
| 9 |
library_name: transformers
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
| 13 |
# UI-TARS-72B-DPO
|
| 14 |
[UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) |
|
| 15 |
[UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) |
|
|
@@ -69,35 +68,6 @@ Code: https://github.com/bytedance/UI-TARS
|
|
| 69 |
| **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
|
| 70 |
|
| 71 |
|
| 72 |
-
- **ScreenSpot**
|
| 73 |
-
|
| 74 |
-
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|
| 75 |
-
|--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
|
| 76 |
-
| **Agent Framework** | | | | | | | |
|
| 77 |
-
| GPT-4 (SeeClick) | 76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |
|
| 78 |
-
| GPT-4 (OmniParser) | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |
|
| 79 |
-
| GPT-4 (UGround-7B) | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |
|
| 80 |
-
| GPT-4o (SeeClick) | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |
|
| 81 |
-
| GPT-4o (UGround-7B) | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |
|
| 82 |
-
| **Agent Model** | | | | | | | |
|
| 83 |
-
| GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |
|
| 84 |
-
| GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |
|
| 85 |
-
| CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |
|
| 86 |
-
| SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |
|
| 87 |
-
| Qwen2-VL | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |
|
| 88 |
-
| UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |
|
| 89 |
-
| Aguvis-G-7B | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |
|
| 90 |
-
| OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |
|
| 91 |
-
| Claude Computer Use | - | - | - | - | - | - | **83.0** |
|
| 92 |
-
| Gemini 2.0 (Project Mariner) | - | - | - | - | - | - | **84.0** |
|
| 93 |
-
| Aguvis-7B | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |
|
| 94 |
-
| Aguvis-72B | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |
|
| 95 |
-
| **Our Model** | | | | | | | |
|
| 96 |
-
| **UI-TARS-2B** | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |
|
| 97 |
-
| **UI-TARS-7B** | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |
|
| 98 |
-
| **UI-TARS-72B** | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
- **ScreenSpot v2**
|
| 102 |
|
| 103 |
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|
|
@@ -116,49 +86,6 @@ Code: https://github.com/bytedance/UI-TARS
|
|
| 116 |
| **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
|
| 117 |
|
| 118 |
|
| 119 |
-
**Offline Agent Capability Evaluation**
|
| 120 |
-
- **Multimodal Mind2Web**
|
| 121 |
-
|
| 122 |
-
| Method | Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |
|
| 123 |
-
|--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|
|
| 124 |
-
| **Agent Framework** | | | | | | | | | |
|
| 125 |
-
| GPT-4o (SeeClick) | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |
|
| 126 |
-
| GPT-4o (UGround) | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |
|
| 127 |
-
| GPT-4o (Aria-UI) | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |
|
| 128 |
-
| GPT-4V (OmniParser) | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
|
| 129 |
-
| **Agent Model** | | | | | | | | | |
|
| 130 |
-
| GPT-4o | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |
|
| 131 |
-
| GPT-4 (SOM) | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |
|
| 132 |
-
| GPT-3.5 (Text-only) | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
|
| 133 |
-
| GPT-4 (Text-only) | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |
|
| 134 |
-
| Claude | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |
|
| 135 |
-
| Aguvis-7B | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |
|
| 136 |
-
| CogAgent | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |
|
| 137 |
-
| Aguvis-72B | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |
|
| 138 |
-
| **Our Model** | | | | | | | | | |
|
| 139 |
-
| **UI-TARS-2B** | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |
|
| 140 |
-
| **UI-TARS-7B** | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |
|
| 141 |
-
| **UI-TARS-72B** | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
- **Android Control and GUI Odyssey**
|
| 145 |
-
|
| 146 |
-
| Agent Models | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |
|
| 147 |
-
|---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|
|
| 148 |
-
| Claude | 74.3 | 0.0 | 19.4 | 63.7 | 0.0 | 12.5 | 60.9 | 0.0 | 3.1 |
|
| 149 |
-
| GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 | 34.3 | 0.0 | 3.3 |
|
| 150 |
-
| SeeClick | 93.0 | 73.4 | 75.0 | 82.9 | 62.9 | 59.1 | 71.0 | 52.4 | 53.9 |
|
| 151 |
-
| InternVL-2-4B | 90.9 | 84.1 | 80.1 | 84.1 | 72.7 | 66.7 | 82.1 | 55.5 | 51.5 |
|
| 152 |
-
| Qwen2-VL-7B | 91.9 | 86.5 | 82.6 | 83.8 | 77.7 | 69.7 | 83.5 | 65.9 | 60.2 |
|
| 153 |
-
| Aria-UI | -- | 87.7 | 67.3 | -- | 43.2 | 10.2 | -- | 86.8 | 36.5 |
|
| 154 |
-
| OS-Atlas-4B | 91.9 | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 | 83.5 | 61.4 | 56.4 |
|
| 155 |
-
| OS-Atlas-7B | 93.6 | 88.0 | 85.2 | 85.2 | 78.5 | 71.2 | 84.5 | 67.8 | 62.0 |
|
| 156 |
-
| Aguvis-7B | -- | -- | 80.5 | -- | -- | 61.5 | -- | -- | -- |
|
| 157 |
-
| Aguvis-72B | -- | -- | 84.4 | -- | -- | 66.4 | -- | -- | -- |
|
| 158 |
-
| **UI-TARS-2B** | **98.1** | 87.3 | 89.3 | 81.2 | 78.4 | 68.9 | 93.9 | 86.8 | 83.4 |
|
| 159 |
-
| **UI-TARS-7B** | 98.0 | 89.3 | 90.8 | 83.7 | 80.5 | 72.5 | 94.6 | 90.1 | 87.0 |
|
| 160 |
-
| **UI-TARS-72B** | **98.1** | **89.9** | **91.3** | **85.2** | **81.5** | **74.7** | **95.4** | **91.4** | **88.6** |
|
| 161 |
-
|
| 162 |
**Online Agent Capability Evaluation**
|
| 163 |
|
| 164 |
| Method | OSWorld (Online) | AndroidWorld (Online) |
|
|
@@ -182,6 +109,273 @@ Code: https://github.com/bytedance/UI-TARS
|
|
| 182 |
| **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
|
| 183 |
| **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
|
| 184 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
## Citation
|
| 187 |
If you find our paper and model useful in your research, feel free to give us a cite.
|
|
|
|
| 9 |
library_name: transformers
|
| 10 |
---
|
| 11 |
|
|
|
|
| 12 |
# UI-TARS-72B-DPO
|
| 13 |
[UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) |
|
| 14 |
[UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) |
|
|
|
|
| 68 |
| **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
|
| 69 |
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
- **ScreenSpot v2**
|
| 72 |
|
| 73 |
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|
|
|
|
| 86 |
| **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
|
| 87 |
|
| 88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
**Online Agent Capability Evaluation**
|
| 90 |
|
| 91 |
| Method | OSWorld (Online) | AndroidWorld (Online) |
|
|
|
|
| 109 |
| **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
|
| 110 |
| **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
|
| 111 |
|
| 112 |
+
## Deployment
|
| 113 |
+
|
| 114 |
+
### Cloud Deployment
|
| 115 |
+
We recommend using HuggingFace Inference Endpoints for fast deployment.
|
| 116 |
+
We provide two docs for users to refer:
|
| 117 |
+
|
| 118 |
+
English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
|
| 119 |
+
|
| 120 |
+
中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
|
| 121 |
+
|
| 122 |
+
### Local Deployment [Transformers]
|
| 123 |
+
We follow the same way as Qwen2-VL, check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.
|
| 124 |
+
|
| 125 |
+
### Local Deployment [vLLM]
|
| 126 |
+
We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`.
|
| 127 |
+
```bash
|
| 128 |
+
pip install -U transformers
|
| 129 |
+
VLLM_VERSION=0.6.6
|
| 130 |
+
CUDA_VERSION=cu124
|
| 131 |
+
pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
|
| 132 |
+
|
| 133 |
+
```
|
| 134 |
+
#### Download the Model
|
| 135 |
+
We provide three model sizes on Hugging Face: **2B**, **7B**, and **72B**. To achieve the best performance, we recommend using the **7B-DPO** or **72B-DPO** model (depends on your GPU configuration):
|
| 136 |
+
|
| 137 |
+
- [2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)
|
| 138 |
+
- [7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)
|
| 139 |
+
- [7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)
|
| 140 |
+
- [72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)
|
| 141 |
+
- [72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
#### Start an OpenAI API Service
|
| 145 |
+
Run the command below to start an OpenAI-compatible API service:
|
| 146 |
+
|
| 147 |
+
```bash
|
| 148 |
+
python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details), you can also use it in [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop):
|
| 152 |
+
```python
|
| 153 |
+
import base64
|
| 154 |
+
from openai import OpenAI
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
instruction = "search for today's weather"
|
| 158 |
+
screenshot_path = "screenshot.png"
|
| 159 |
+
client = OpenAI(
|
| 160 |
+
base_url="http://127.0.0.1:8000/v1",
|
| 161 |
+
api_key="empty",
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
## Below is the prompt for mobile
|
| 165 |
+
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
|
| 166 |
+
|
| 167 |
+
## Output Format
|
| 168 |
+
```\nThought: ...
|
| 169 |
+
Action: ...\n```
|
| 170 |
+
|
| 171 |
+
## Action Space
|
| 172 |
+
|
| 173 |
+
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 174 |
+
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 175 |
+
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 176 |
+
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
|
| 177 |
+
hotkey(key='')
|
| 178 |
+
type(content='') #If you want to submit your input, use \"\
|
| 179 |
+
\" at the end of `content`.
|
| 180 |
+
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
|
| 181 |
+
wait() #Sleep for 5s and take a screenshot to check for any changes.
|
| 182 |
+
finished()
|
| 183 |
+
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
## Note
|
| 187 |
+
- Use Chinese in `Thought` part.
|
| 188 |
+
- Summarize your next action (with its target element) in one sentence in `Thought` part.
|
| 189 |
+
|
| 190 |
+
## User Instruction
|
| 191 |
+
"""
|
| 192 |
+
|
| 193 |
+
with open(screenshot_path, "rb") as image_file:
|
| 194 |
+
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
|
| 195 |
+
response = client.chat.completions.create(
|
| 196 |
+
model="ui-tars",
|
| 197 |
+
messages=[
|
| 198 |
+
{
|
| 199 |
+
"role": "user",
|
| 200 |
+
"content": [
|
| 201 |
+
{"type": "text", "text": prompt + instruction},
|
| 202 |
+
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
|
| 203 |
+
],
|
| 204 |
+
},
|
| 205 |
+
],
|
| 206 |
+
frequency_penalty=1,
|
| 207 |
+
max_tokens=128,
|
| 208 |
+
)
|
| 209 |
+
print(response.choices[0].message.content)
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
For single step grounding task or inference on grounding dataset such as Seeclick, kindly refer to the following script:
|
| 213 |
+
```python
|
| 214 |
+
import base64
|
| 215 |
+
from openai import OpenAI
|
| 216 |
+
|
| 217 |
+
|
| 218 |
+
instruction = "search for today's weather"
|
| 219 |
+
screenshot_path = "screenshot.png"
|
| 220 |
+
client = OpenAI(
|
| 221 |
+
base_url="http://127.0.0.1:8000/v1",
|
| 222 |
+
api_key="empty",
|
| 223 |
+
)
|
| 224 |
+
|
| 225 |
+
## Below is the prompt for mobile
|
| 226 |
+
prompt = r"""Output only the coordinate of one point in your response. What element matches the following task: """
|
| 227 |
+
|
| 228 |
+
with open(screenshot_path, "rb") as image_file:
|
| 229 |
+
encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
|
| 230 |
+
response = client.chat.completions.create(
|
| 231 |
+
model="ui-tars",
|
| 232 |
+
messages=[
|
| 233 |
+
{
|
| 234 |
+
"role": "user",
|
| 235 |
+
"content": [
|
| 236 |
+
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
|
| 237 |
+
{"type": "text", "text": prompt + instruction}
|
| 238 |
+
],
|
| 239 |
+
},
|
| 240 |
+
],
|
| 241 |
+
frequency_penalty=1,
|
| 242 |
+
max_tokens=128,
|
| 243 |
+
)
|
| 244 |
+
print(response.choices[0].message.content)
|
| 245 |
+
```
|
| 246 |
+
|
| 247 |
+
### Prompt Templates
|
| 248 |
+
We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.
|
| 249 |
+
- Prompt template for mobile:
|
| 250 |
+
```python
|
| 251 |
+
## Below is the prompt for mobile
|
| 252 |
+
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
|
| 253 |
+
|
| 254 |
+
## Output Format
|
| 255 |
+
```\nThought: ...
|
| 256 |
+
Action: ...\n```
|
| 257 |
+
|
| 258 |
+
## Action Space
|
| 259 |
+
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 260 |
+
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
|
| 261 |
+
type(content='')
|
| 262 |
+
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
|
| 263 |
+
press_home()
|
| 264 |
+
press_back()
|
| 265 |
+
finished(content='') # Submit the task regardless of whether it succeeds or fails.
|
| 266 |
+
|
| 267 |
+
## Note
|
| 268 |
+
- Use English in `Thought` part.
|
| 269 |
+
|
| 270 |
+
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
|
| 271 |
+
|
| 272 |
+
## User Instruction
|
| 273 |
+
"""
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
- Prompt template for computer:
|
| 277 |
+
```python
|
| 278 |
+
## Below is the prompt for computer
|
| 279 |
+
prompt = r"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
|
| 280 |
+
|
| 281 |
+
## Output Format
|
| 282 |
+
```\nThought: ...
|
| 283 |
+
Action: ...\n```
|
| 284 |
+
|
| 285 |
+
## Action Space
|
| 286 |
+
|
| 287 |
+
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 288 |
+
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 289 |
+
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
|
| 290 |
+
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
|
| 291 |
+
hotkey(key='')
|
| 292 |
+
type(content='') #If you want to submit your input, use \"\
|
| 293 |
+
\" at the end of `content`.
|
| 294 |
+
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
|
| 295 |
+
wait() #Sleep for 5s and take a screenshot to check for any changes.
|
| 296 |
+
finished()
|
| 297 |
+
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
|
| 298 |
+
|
| 299 |
+
|
| 300 |
+
## Note
|
| 301 |
+
- Use Chinese in `Thought` part.
|
| 302 |
+
- Summarize your next action (with its target element) in one sentence in `Thought` part.
|
| 303 |
+
|
| 304 |
+
## User Instruction
|
| 305 |
+
"""
|
| 306 |
+
```
|
| 307 |
+
|
| 308 |
+
### Local Deployment [Ollama]
|
| 309 |
+
<!-- Ollama can deploy the model via gguf format. Bugs exist for safetensors. -->Ollama will be coming soon. Please be patient and wait~ 😊
|
| 310 |
+
<!-- #### Get the model in GGUF format
|
| 311 |
+
We provide 2B and 7B model in [GGUF](https://huggingface.co/docs/hub/en/gguf) format:
|
| 312 |
+
|
| 313 |
+
2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf
|
| 314 |
+
|
| 315 |
+
7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf
|
| 316 |
+
|
| 317 |
+
Users can convert the model into GGUF format by using the script from [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
|
| 318 |
+
|
| 319 |
+
```bash
|
| 320 |
+
python3 convert_hf_to_gguf.py <path to your model>
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
The GGUF file will be generated under the path provided.
|
| 324 |
+
|
| 325 |
+
#### Deploy GGUF model
|
| 326 |
+
We deploy the model by following Ollama [tutorial](https://github.com/ollama/ollama?tab=readme-ov-file#customize-a-model).
|
| 327 |
+
|
| 328 |
+
```bash
|
| 329 |
+
# Create Modelfile, Windows users can just create a file named Modelfile
|
| 330 |
+
echo "FROM ./path/to/model.gguf" > Modelfile
|
| 331 |
+
|
| 332 |
+
# Create model in Ollama
|
| 333 |
+
ollama create ui-tars -f Modelfile
|
| 334 |
+
|
| 335 |
+
# Run the model
|
| 336 |
+
ollama run ui-tars
|
| 337 |
+
|
| 338 |
+
```
|
| 339 |
+
|
| 340 |
+
Test script is same as vLLM except two changes:
|
| 341 |
+
|
| 342 |
+
```python
|
| 343 |
+
...
|
| 344 |
+
client = OpenAI(
|
| 345 |
+
base_url="http://127.0.0.1:11434/v1/",
|
| 346 |
+
...
|
| 347 |
+
)
|
| 348 |
+
...
|
| 349 |
+
response = client.chat.completions.create(
|
| 350 |
+
model="ui-tars" # the name we create via Ollama cli
|
| 351 |
+
...
|
| 352 |
+
)
|
| 353 |
+
|
| 354 |
+
``` -->
|
| 355 |
+
|
| 356 |
+
### Explanation of Inference Results
|
| 357 |
+
|
| 358 |
+
#### Coordinate Mapping
|
| 359 |
+
The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:
|
| 360 |
+
- X absolute = X relative × image width
|
| 361 |
+
- Y absolute = Y relative × image height
|
| 362 |
+
|
| 363 |
+
For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is `round(1920*235/1000)=451`. The Y absolute is `round(1080*512/1000)=553`. The absolute coordinate is (451, 553)
|
| 364 |
+
|
| 365 |
+
## Use in desktop and web automation
|
| 366 |
+
|
| 367 |
+
To experience ui-tars agent in desktop, you may refer to [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop). We recommend using the **7B/72B DPO model** on desktop.
|
| 368 |
+
|
| 369 |
+
[Midscene.js](https://github.com/web-infra-dev/Midscene) is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See [this guide](https://midscenejs.com/choose-a-model) for more details about setting up the model.
|
| 370 |
+
|
| 371 |
+
## License
|
| 372 |
+
|
| 373 |
+
UI-TARS is licensed under the Apache License 2.0.
|
| 374 |
+
|
| 375 |
+
## Acknowledgements
|
| 376 |
+
This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
|
| 377 |
+
|
| 378 |
+
Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.
|
| 379 |
|
| 380 |
## Citation
|
| 381 |
If you find our paper and model useful in your research, feel free to give us a cite.
|