Instructions to use HelloKKMe/GTA1-72B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HelloKKMe/GTA1-72B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="HelloKKMe/GTA1-72B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HelloKKMe/GTA1-72B") model = AutoModelForImageTextToText.from_pretrained("HelloKKMe/GTA1-72B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HelloKKMe/GTA1-72B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HelloKKMe/GTA1-72B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HelloKKMe/GTA1-72B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/HelloKKMe/GTA1-72B
- SGLang
How to use HelloKKMe/GTA1-72B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HelloKKMe/GTA1-72B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HelloKKMe/GTA1-72B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HelloKKMe/GTA1-72B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HelloKKMe/GTA1-72B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use HelloKKMe/GTA1-72B with Docker Model Runner:
docker model run hf.co/HelloKKMe/GTA1-72B
Improve model card: Add pipeline tag, abstract, and project resources
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,8 +1,31 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
# Introduction
|
| 7 |
|
| 8 |
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignmentβrewarding successful clicksβrather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1), we share state-of-the-art GUI grounding models trained using GRPO.
|
|
@@ -11,27 +34,26 @@ Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inh
|
|
| 11 |
|
| 12 |
We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
|
| 13 |
|
| 14 |
-
| **Model**
|
| 15 |
-
|---
|
| 16 |
-
| OpenAI CUA
|
| 17 |
-
| Claude 3.7
|
| 18 |
-
| JEDI-7B
|
| 19 |
-
| SE-GUI
|
| 20 |
-
| UI-TARS
|
| 21 |
-
| UI-TARS-1.5*
|
| 22 |
-
| UGround-v1-7B
|
| 23 |
-
| Qwen2.5-VL-32B-Instruct | 32B | β
|
| 24 |
-
| UGround-v1-72B
|
| 25 |
-
| Qwen2.5-VL-72B-Instruct | 72B | β
|
| 26 |
-
| UI-TARS
|
| 27 |
-
| GTA1 (Ours)
|
| 28 |
-
| GTA1 (Ours)
|
| 29 |
-
| GTA1 (Ours)
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
>
|
| 33 |
-
> -
|
| 34 |
-
> - A dash (β) denotes results that are currently unavailable.
|
| 35 |
> - A superscript asterisk (οΉ‘) denotes our evaluated result.
|
| 36 |
> - UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
|
| 37 |
> - β indicates the performance improvement (β) of our model compared to its baseline.
|
|
@@ -119,10 +141,36 @@ generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(in
|
|
| 119 |
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
|
| 120 |
|
| 121 |
# Extract and rescale coordinates
|
| 122 |
-
pred_x, pred_y = extract_coordinates(output_text)
|
| 123 |
pred_x*=scale_x
|
| 124 |
-
pred_y*=scale_y
|
| 125 |
print(pred_x,pred_y)
|
| 126 |
```
|
| 127 |
|
| 128 |
-
Refer to our [code](https://github.com/Yan98/GTA1) for more details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- gui
|
| 7 |
+
- agent
|
| 8 |
+
- visual-grounding
|
| 9 |
+
- multimodal
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- qwen2.5
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# GTA1: GUI Test-time Scaling Agent
|
| 15 |
+
|
| 16 |
+
This repository contains the GUI grounding model presented in the paper [GTA1: GUI Test-time Scaling Agent](https://huggingface.co/papers/2507.05791).
|
| 17 |
+
|
| 18 |
+
## Abstract
|
| 19 |
+
|
| 20 |
+
Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here.
|
| 21 |
+
|
| 22 |
+
## Project Resources
|
| 23 |
+
|
| 24 |
+
* **Paper:** [GTA1: GUI Test-time Scaling Agent](https://huggingface.co/papers/2507.05791)
|
| 25 |
+
* **GitHub Repository:** [Yan98/GTA1](https://github.com/Yan98/GTA1)
|
| 26 |
+
* **Project Page:** [yan98.github.io/GTA1](https://yan98.github.io/GTA1/)
|
| 27 |
+
* **Blog Post:** [Grounding R1](https://huggingface.co/blog/HelloKKMe/grounding-r1)
|
| 28 |
+
|
| 29 |
# Introduction
|
| 30 |
|
| 31 |
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignmentβrewarding successful clicksβrather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1), we share state-of-the-art GUI grounding models trained using GRPO.
|
|
|
|
| 34 |
|
| 35 |
We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
|
| 36 |
|
| 37 |
+
| **Model** | **Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** |
|
| 38 |
+
|---|:---:|:---:|:---:|:---:|:---:|
|
| 39 |
+
| OpenAI CUA | β | β | 87.9 | 23.4 | β |
|
| 40 |
+
| Claude 3.7 | β | β | 87.6 | 27.7 | β |
|
| 41 |
+
| JEDI-7B | 7B | β
| 91.7 | 39.5 | 54.1 |
|
| 42 |
+
| SE-GUI | 7B | β
| 90.3 | 47.0 | β |
|
| 43 |
+
| UI-TARS | 7B | β
| 91.6 | 35.7 | 47.5 |
|
| 44 |
+
| UI-TARS-1.5* | 7B | β
| 89.7* | 42.0* | 64.2* |
|
| 45 |
+
| UGround-v1-7B | 7B | β
| β | 31.1 | 36.4 |
|
| 46 |
+
| Qwen2.5-VL-32B-Instruct | 32B | β
| 91.9* | 48.0 | 59.6* |
|
| 47 |
+
| UGround-v1-72B | 72B | β
| β | 34.5 | β |
|
| 48 |
+
| Qwen2.5-VL-72B-Instruct | 72B | β
| 94.00* | 53.3 | 62.2* |
|
| 49 |
+
| UI-TARS | 72B | β
| 90.3 | 38.1 | β |
|
| 50 |
+
| GTA1 (Ours) | 7B | β
| 92.4 <sub>*(β +2.7)*</sub> | 50.1<sub>*(β +8.1)*</sub> | 67.7 <sub>*(β +3.5)*</sub> |
|
| 51 |
+
| GTA1 (Ours) | 32B | β
| 93.2 <sub>*(β +1.3)*</sub> | 53.6 <sub>*(β +5.6)*</sub> | 61.9<sub>*(β +2.3)*</sub> |
|
| 52 |
+
| GTA1 (Ours) | 72B | β
| 94.8<sub>*(β +0.8)*</sub> | 58.4 <sub>*(β +5.1)*</sub> | 66.7<sub>*(β +4.5)*</sub> |
|
| 53 |
+
|
| 54 |
+
> **Note:**
|
| 55 |
+
> - Model size is indicated in billions (B) of parameters.
|
| 56 |
+
> - A dash (β) denotes results that are currently unavailable.
|
|
|
|
| 57 |
> - A superscript asterisk (οΉ‘) denotes our evaluated result.
|
| 58 |
> - UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
|
| 59 |
> - β indicates the performance improvement (β) of our model compared to its baseline.
|
|
|
|
| 141 |
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
|
| 142 |
|
| 143 |
# Extract and rescale coordinates
|
| 144 |
+
pred_x, pred_y = extract_coordinates(output_text)
|
| 145 |
pred_x*=scale_x
|
| 146 |
+
pred_y*=scale_y
|
| 147 |
print(pred_x,pred_y)
|
| 148 |
```
|
| 149 |
|
| 150 |
+
Refer to our [code](https://github.com/Yan98/GTA1) for more details.
|
| 151 |
+
|
| 152 |
+
## Agent Performance
|
| 153 |
+
|
| 154 |
+
Refer to an inference example [here](https://github.com/xlang-ai/OSWorld/pull/246/files#diff-2b758e4fafd9a52ee08bd6072f64297e4d880193fcf3f0e480da954a6711afa7).
|
| 155 |
+
|
| 156 |
+
## Contact
|
| 157 |
+
|
| 158 |
+
Please contact `yan.yang@anu.edu.au` for any queries.
|
| 159 |
+
|
| 160 |
+
## Acknowledgement
|
| 161 |
+
|
| 162 |
+
We thank the open-source projects: [VLM-R1](https://github.com/om-ai-lab/VLM-R1), [Jedi](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/jedi_7b_agent.py), and [Agent-S2](https://github.com/simular-ai/Agent-S).
|
| 163 |
+
|
| 164 |
+
## Citation
|
| 165 |
+
If you use this repository or find it helpful in your research, please cite it as follows:
|
| 166 |
+
```bibtex
|
| 167 |
+
@misc{yang2025gta1guitesttimescaling,
|
| 168 |
+
title={GTA1: GUI Test-time Scaling Agent},
|
| 169 |
+
author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Caiming Xiong and Junnan Li},
|
| 170 |
+
year={2025},
|
| 171 |
+
eprint={2507.05791},
|
| 172 |
+
archivePrefix={arXiv},
|
| 173 |
+
primaryClass={cs.AI},
|
| 174 |
+
url={https://arxiv.org/abs/2507.05791},
|
| 175 |
+
}
|
| 176 |
+
```
|