Instructions to use HelloKKMe/GTA1-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HelloKKMe/GTA1-32B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="HelloKKMe/GTA1-32B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HelloKKMe/GTA1-32B") model = AutoModelForImageTextToText.from_pretrained("HelloKKMe/GTA1-32B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HelloKKMe/GTA1-32B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HelloKKMe/GTA1-32B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HelloKKMe/GTA1-32B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/HelloKKMe/GTA1-32B
- SGLang
How to use HelloKKMe/GTA1-32B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HelloKKMe/GTA1-32B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HelloKKMe/GTA1-32B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HelloKKMe/GTA1-32B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HelloKKMe/GTA1-32B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use HelloKKMe/GTA1-32B with Docker Model Runner:
docker model run hf.co/HelloKKMe/GTA1-32B
Improve model card: Add pipeline tag, license, and enhance overview
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,13 +1,36 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1), we share state-of-the-art GUI grounding models trained using GRPO.
|
| 9 |
|
| 10 |
-
# Performance
|
| 11 |
|
| 12 |
We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
|
| 13 |
|
|
@@ -36,7 +59,7 @@ We follow the standard evaluation protocol and benchmark our model on three chal
|
|
| 36 |
> - UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
|
| 37 |
> - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
|
| 38 |
|
| 39 |
-
# Inference
|
| 40 |
Below is a code snippet demonstrating how to run inference using a trained model.
|
| 41 |
|
| 42 |
```python
|
|
@@ -125,4 +148,28 @@ pred_y*=scale_y
|
|
| 125 |
print(pred_x,pred_y)
|
| 126 |
```
|
| 127 |
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- gui-agent
|
| 5 |
+
- vlm
|
| 6 |
+
- reinforcement-learning
|
| 7 |
+
- gui-grounding
|
| 8 |
+
- gui-automation
|
| 9 |
+
pipeline_tag: image-text-to-text
|
| 10 |
+
license: apache-2.0
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# GTA1: GUI Test-time Scaling Agent
|
| 14 |
+
|
| 15 |
+
This repository contains the model presented in [GTA1: GUI Test-time Scaling Agent](https://huggingface.co/papers/2507.05791).
|
| 16 |
+
|
| 17 |
+
<div align="center">
|
| 18 |
+
<img style="width: 100%" src="https://raw.githubusercontent.com/Yan98/GTA1/main/assets/img/model.png">
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. This paper introduces **GTA1**, a GUI Test-time Scaling Agent, which addresses two main challenges: i) resolving ambiguity in task planning by introducing a test-time scaling method that samples and selects optimal action proposals; and ii) accurately grounding actions in complex, high-resolution interfaces. GTA1 proposes a model for improved grounding accuracy through reinforcement learning, leveraging inherent objective alignments that reward successful interactions.
|
| 22 |
+
|
| 23 |
+
This model specifically focuses on the GUI grounding component of GTA1.
|
| 24 |
+
|
| 25 |
+
**[Code on GitHub](https://github.com/Yan98/GTA1)** | **[Paper](https://huggingface.co/papers/2507.05791)** | **[Blog](https://huggingface.co/blog/HelloKKMe/grounding-r1)**
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## About the Grounding Model
|
| 30 |
|
| 31 |
Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1), we share state-of-the-art GUI grounding models trained using GRPO.
|
| 32 |
|
| 33 |
+
## Performance
|
| 34 |
|
| 35 |
We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
|
| 36 |
|
|
|
|
| 59 |
> - UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
|
| 60 |
> - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
|
| 61 |
|
| 62 |
+
## Inference
|
| 63 |
Below is a code snippet demonstrating how to run inference using a trained model.
|
| 64 |
|
| 65 |
```python
|
|
|
|
| 148 |
print(pred_x,pred_y)
|
| 149 |
```
|
| 150 |
|
| 151 |
+
## Agent Performance
|
| 152 |
+
|
| 153 |
+
Refer to our [code](https://github.com/Yan98/GTA1) for more details on agent inference.
|
| 154 |
+
|
| 155 |
+
## Contact
|
| 156 |
+
|
| 157 |
+
Please contact `yan.yang@anu.edu.au` for any queries.
|
| 158 |
+
|
| 159 |
+
## Acknowledgement
|
| 160 |
+
|
| 161 |
+
We thank the open-source projects: [VLM-R1](https://github.com/om-ai-lab/VLM-R1), [Jedi](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/jedi_7b_agent.py), and [Agent-S2](https://github.com/simular-ai/Agent-S).
|
| 162 |
+
|
| 163 |
+
## Citation
|
| 164 |
+
If you use this repository or find it helpful in your research, please cite it as follows:
|
| 165 |
+
```bibtex
|
| 166 |
+
@misc{yang2025gta1guitesttimescaling,
|
| 167 |
+
title={GTA1: GUI Test-time Scaling Agent},
|
| 168 |
+
author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Caiming Xiong and Junnan Li},
|
| 169 |
+
year={2025},
|
| 170 |
+
eprint={2507.05791},
|
| 171 |
+
archivePrefix={arXiv},
|
| 172 |
+
primaryClass={cs.AI},
|
| 173 |
+
url={https://arxiv.org/abs/2507.05791},
|
| 174 |
+
}
|
| 175 |
+
```
|