Instructions to use ByteDance-Seed/UI-TARS-1.5-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ByteDance-Seed/UI-TARS-1.5-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ByteDance-Seed/UI-TARS-1.5-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ByteDance-Seed/UI-TARS-1.5-7B") model = AutoModelForImageTextToText.from_pretrained("ByteDance-Seed/UI-TARS-1.5-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ByteDance-Seed/UI-TARS-1.5-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ByteDance-Seed/UI-TARS-1.5-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance-Seed/UI-TARS-1.5-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ByteDance-Seed/UI-TARS-1.5-7B
- SGLang
How to use ByteDance-Seed/UI-TARS-1.5-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ByteDance-Seed/UI-TARS-1.5-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance-Seed/UI-TARS-1.5-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ByteDance-Seed/UI-TARS-1.5-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ByteDance-Seed/UI-TARS-1.5-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ByteDance-Seed/UI-TARS-1.5-7B with Docker Model Runner:
docker model run hf.co/ByteDance-Seed/UI-TARS-1.5-7B
Error bbox locating

I use Midscene.js in web, the action is to click the serach box. but it click an error location. Is there any problem of coordinate mapping?
What's more, the result of test case in https://github.com/bytedance/UI-TARS/blob/main/README_deploy.md is "
Thought: 我看到系统设置界面已经打开了,但这里显示的都是些基本的系统参数,比如缓存大小和内存使用情况。要设置图片的颜色模式,我得先找到"Color Management"这个选项。让我在左侧的设置列表中找找看,应该就在这些选项里面。
Action: click(start_box='(197,549)')"
which return an wrong box too.
I seem to have the same problem, I have encountered inaccurate coordinates when using Midscene and UI-TARS-desktop
Perhaps this can solve the problem here.
I tested their new code but seems the issues is still there, in OSWorld, the model seems to also click on the same location multiple times
Same for me. Also qwen2.5VL
For the coordinate conversion issue, please refer to this tutorial.
Regarding the stuck problem, we have indeed observed it on the 7B model. We plan to release the full UI-TARS-1.5 model in the future, which will include significant improvements to this issue. Stay tuned!