Instructions to use Salesforce/GTA1-7B-2507 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Salesforce/GTA1-7B-2507 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Salesforce/GTA1-7B-2507") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Salesforce/GTA1-7B-2507") model = AutoModelForImageTextToText.from_pretrained("Salesforce/GTA1-7B-2507") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Salesforce/GTA1-7B-2507 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Salesforce/GTA1-7B-2507" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/GTA1-7B-2507", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Salesforce/GTA1-7B-2507
- SGLang
How to use Salesforce/GTA1-7B-2507 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Salesforce/GTA1-7B-2507" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/GTA1-7B-2507", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Salesforce/GTA1-7B-2507" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Salesforce/GTA1-7B-2507", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Salesforce/GTA1-7B-2507 with Docker Model Runner:
docker model run hf.co/Salesforce/GTA1-7B-2507
Introduction
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our blog, we share state-of-the-art GUI grounding models trained using GRPO.
Grounding Performance
We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
| Model | Size | Open Source | ScreenSpot-V2 | ScreenSpotPro | OSWORLD-G | OSWORLD-G-Refined |
|---|---|---|---|---|---|---|
| OpenAI CUA | — | ❌ | 87.9 | 23.4 | — | — |
| Claude 3.7 | — | ❌ | 87.6 | 27.7 | — | — |
| JEDI-7B | 7B | ✅ | 91.7 | 39.5 | 54.1 | — |
| SE-GUI | 7B | ✅ | 90.3 | 47.0 | — | — |
| UI-TARS | 7B | ✅ | 91.6 | 35.7 | 47.5 | — |
| UI-TARS-1.5* | 7B | ✅ | 89.7* | 42.0* | 52.8* | 64.2* |
| UGround-v1-7B | 7B | ✅ | — | 31.1 | — | 36.4 |
| Qwen2.5-VL-32B-Instruct | 32B | ✅ | 91.9* | 48.0 | 46.5 | 59.6* |
| UGround-v1-72B | 72B | ✅ | — | 34.5 | — | — |
| Qwen2.5-VL-72B-Instruct | 72B | ✅ | 94.00* | 53.3 | — | 62.2* |
| UI-TARS | 72B | ✅ | 90.3 | 38.1 | — | — |
| OpenCUA | 7B | ✅ | 92.3 | 50.0 | 55.3 | 68.3* |
| OpenCUA | 32B | ✅ | 93.4 | 55.3 | 59.6 | 70.2* |
| GTA1-2507 (Ours) | 7B | ✅ | 92.4 (∆ +2.7) | 50.1(∆ +8.1) | 55.1 (∆ +2.3) | 67.7 (∆ +3.5) |
| GTA1 (Ours) | 7B | ✅ | 93.4 (∆ +0.1) | 55.5(∆ +5.5) | 60.1(∆ +4.8) | 68.8(∆ +0.5) |
| GTA1 (Ours) | 32B | ✅ | 95.2 (∆ +1.8) | 63.6(∆ +8.3) | 65.2 (∆ +5.6) | 72.2(∆ +2.0) |
Note:
- Model size is indicated in billions (B) of parameters.
- A dash (—) denotes results that are currently unavailable.
- A superscript asterisk (﹡) denotes our evaluated result.
- UI-TARS-1.5 7B, OpenCUA-7B, and OpenCUA-32B are applied as our baseline models.
- ∆ indicates the performance improvement (∆) of our model compared to its baseline.
Agent Performance
OSWorld and OSWorld-Verified Benchmarks
We evaluate our models on the OSWorld and OSWorld-Verified benchmarks following the standard evaluation protocol. The results demonstrate strong performance across both datasets.
| Agent Model | Step | OSWorld | OSWorld-Verified |
|---|---|---|---|
| Proprietary Models | |||
| Claude 3.7 Sonnet | 100 | 28.0 | — |
| OpenAI CUA 4o | 200 | 38.1 | — |
| UI-TARS-1.5 | 100 | 42.5 | 41.8 |
| OpenAI CUA o3 | 200 | 42.9 | — |
| Open-Source Models | |||
| Aria-UI w/ GPT-4o | 15 | 15.2 | — |
| Aguvis-72B w/ GPT-4o | 15 | 17.0 | — |
| UI-TARS-72B-SFT | 50 | 18.8 | — |
| Agent S w/ Claude-3.5-Sonnet | 15 | 20.5 | — |
| Agent S w/ GPT-4o | 15 | 20.6 | — |
| UI-TARS-72B-DPO | 15 | 22.7 | — |
| UI-TARS-72B-DPO | 50 | 24.6 | — |
| UI-TARS-1.5-7B | 100 | 26.9 | 27.4 |
| Jedi-7B w/ o3 | 100 | — | 51.0 |
| Jedi-7B w/ GPT-4o | 100 | 27.0 | — |
| Agent S2 w/ Claude-3.7-Sonnet | 50 | 34.5 | — |
| Agent S2 w/ Gemini-2.5-Pro | 50 | 41.4 | 45.8 |
| Agent S2.5 w/ o3 | 100 | — | 56.0 |
| Agent S2.5 w/ GPT-5 | 100 | — | 58.4 |
| CoAct-1 w/o3 & o4mini & OpenAI CUA 4o | 150 | — | 60.8 |
| GTA1-7B-2507 w/ o3 | 100 | 45.2 | 53.1 |
| GTA1-7B-2507 w/ GPT-5 | 100 | — | 61.0 |
| GTA1-32B w/ o3 | 100 | — | 55.4 |
| GTA1-32B w/ GPT-5 | 100 | — | 63.4 |
Note: A dash (—) indicates unavailable results.
WindowsAgentArena Benchmark
We also evaluate our models on the WindowsAgentArena benchmark, demonstrating strong performance in Windows-specific GUI automation tasks.
| Agent Model | Step | Success Rate |
|---|---|---|
| Kimi-VL | 15 | 10.4 |
| WAA | — | 19.5 |
| Jedi w/ GPT-4o | 100 | 33.7 |
| GTA1-7B-2507 w/ o3 | 100 | 47.9 |
| GTA1-7B-2507 w/ GPT-5 | 100 | 49.2 |
| GTA1-32B w/ o3 | 100 | 51.2 |
| GTA1-32B w/ GPT-5 | 100 | 50.6 |
Note: A dash (—) indicates unavailable results.
Inference
Below is a code snippet demonstrating how to run inference using a trained model.
from PIL import Image
from qwen_vl_utils import process_vision_info, smart_resize
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
import re
SYSTEM_PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
Output the coordinate pair exactly:
(x,y)
'''
SYSTEM_PROMPT=SYSTEM_PROMPT.strip()
# Function to extract coordinates from model output
def extract_coordinates(raw_string):
try:
matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
return [tuple(map(int, match)) for match in matches][0]
except:
return 0,0
# Load model and processor
model_path = "HelloKKMe/GTA1-7B"
max_new_tokens = 32
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=3136,
max_pixels= 4096 * 2160
)
# Load and resize image
image = Image.open("file path")
instruction = "description" # Instruction for grounding
width, height = image.width, image.height
resized_height, resized_width = smart_resize(
image.height,
image.width,
factor=processor.image_processor.patch_size * processor.image_processor.merge_size,
min_pixels=processor.image_processor.min_pixels,
max_pixels=processor.image_processor.max_pixels,
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height
# Prepare system and user messages
system_message = {
"role": "system",
"content": SYSTEM_PROMPT.format(height=resized_height,width=resized_width)
}
user_message = {
"role": "user",
"content": [
{"type": "image", "image": resized_image},
{"type": "text", "text": instruction}
]
}
# Tokenize and prepare inputs
image_inputs, video_inputs = process_vision_info([system_message, user_message])
text = processor.apply_chat_template([system_message, user_message], tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(model.device)
# Generate prediction
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, temperature=1.0, use_cache=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
# Extract and rescale coordinates
pred_x, pred_y = extract_coordinates(output_text)
pred_x*=scale_x
pred_y*=scale_y
print(pred_x,pred_y)
Ethical Considerations
This model is released for research and educational purposes. While our model demonstrates strong performance on GUI benchmarks, users should carefully evaluate its suitability for their specific use cases.
Important Considerations:
- Accuracy Limitations: Like all AI systems, this model may produce incorrect outputs or fail to accurately identify GUI elements in certain scenarios.
- Safety and Security: Exercise caution when deploying GUI automation agents, especially in production environments where incorrect actions could affect system integrity or data security.
- Human Oversight: We recommend maintaining appropriate human supervision when using this model for automated GUI interactions.
- Compliance: Users are responsible for ensuring their use of this model complies with applicable laws, regulations, and organizational policies.
Recommended Best Practices:
- Thoroughly test the model in controlled environments before production deployment
- Implement safeguards and error handling mechanisms
- Consider the potential impact of automated actions on user systems and data
- Regularly monitor and validate model performance in your specific domain
For further guidance on use cases, refer to our AUP and AI AUP.
Citation
If you're using any GTA model or find it helpful in your research, please cite it as follows:
@article{yang2025gta1guitesttimescaling,
title={GTA1: GUI Test-time Scaling Agent},
author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Silvio Savarese and Caiming Xiong and Junnan Li},
year={2025},
eprint={2507.05791},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.05791},
}
- Downloads last month
- 1,485