GTA1-72B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag, abstract, and project resources
dd7073b verified
|
raw
history blame
9.02 kB
metadata
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
tags:
  - gui
  - agent
  - visual-grounding
  - multimodal
  - reinforcement-learning
  - qwen2.5

GTA1: GUI Test-time Scaling Agent

This repository contains the GUI grounding model presented in the paper GTA1: GUI Test-time Scaling Agent.

Abstract

Graphical user interface (GUI) agents autonomously operate across platforms (e.g., Linux) to complete tasks by interacting with visual elements. Specifically, a user instruction is decomposed into a sequence of action proposals, each corresponding to an interaction with the GUI. After each action, the agent observes the updated GUI environment to plan the next step. However, two main challenges arise: i) resolving ambiguity in task planning (i.e., the action proposal sequence), where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the two aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, to select the most appropriate action proposal, we introduce a test-time scaling method. At each step, we sample multiple candidate action proposals and leverage a judge model to evaluate and select the most suitable one. It trades off computation for better decision quality by concurrent sampling, shortening task execution steps, and improving overall performance. Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates visual grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, our method establishes state-of-the-art performance across diverse benchmarks. For example, GTA1-7B achieves 50.1%, 92.4%, and 67.7% accuracies on Screenspot-Pro, Screenspot-V2, and OSWorld-G, respectively. When paired with a planner applying our test-time scaling strategy, it exhibits state-of-the-art agentic performance (e.g., 45.2% task success rate on OSWorld). We open-source our code and models here.

Project Resources

Introduction

Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our blog, we share state-of-the-art GUI grounding models trained using GRPO.

Performance

We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:

Model Size Open Source ScreenSpot-V2 ScreenSpotPro OSWORLD-G
OpenAI CUA 87.9 23.4
Claude 3.7 87.6 27.7
JEDI-7B 7B 91.7 39.5 54.1
SE-GUI 7B 90.3 47.0
UI-TARS 7B 91.6 35.7 47.5
UI-TARS-1.5* 7B 89.7* 42.0* 64.2*
UGround-v1-7B 7B 31.1 36.4
Qwen2.5-VL-32B-Instruct 32B 91.9* 48.0 59.6*
UGround-v1-72B 72B 34.5
Qwen2.5-VL-72B-Instruct 72B 94.00* 53.3 62.2*
UI-TARS 72B 90.3 38.1
GTA1 (Ours) 7B 92.4 (∆ +2.7) 50.1(∆ +8.1) 67.7 (∆ +3.5)
GTA1 (Ours) 32B 93.2 (∆ +1.3) 53.6 (∆ +5.6) 61.9(∆ +2.3)
GTA1 (Ours) 72B 94.8(∆ +0.8) 58.4 (∆ +5.1) 66.7(∆ +4.5)

Note:

  • Model size is indicated in billions (B) of parameters.
  • A dash (—) denotes results that are currently unavailable.
  • A superscript asterisk (﹡) denotes our evaluated result.
  • UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
  • ∆ indicates the performance improvement (∆) of our model compared to its baseline.

Inference

Below is a code snippet demonstrating how to run inference using a trained model.

from PIL import Image
from qwen_vl_utils import process_vision_info, smart_resize
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
import re

SYSTEM_PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.

Output the coordinate pair exactly:
(x,y)
'''
SYSTEM_PROMPT=SYSTEM_PROMPT.strip()

# Function to extract coordinates from model output
def extract_coordinates(raw_string):
    try:
        matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
        return [tuple(map(int, match)) for match in matches][0]
    except:
        return 0,0

# Load model and processor
model_path = "HelloKKMe/GTA1-72B"
max_new_tokens = 32

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    model_path,
    min_pixels=3136,
    max_pixels= 4096 * 2160
)

# Load and resize image
image = Image.open("file path")
instruction = "description"  # Instruction for grounding
width, height = image.width, image.height

resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=processor.image_processor.patch_size * processor.image_processor.merge_size,
    min_pixels=processor.image_processor.min_pixels,
    max_pixels=processor.image_processor.max_pixels,
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height

# Prepare system and user messages
system_message = {
   "role": "system",
   "content": SYSTEM_PROMPT.format(height=resized_height,width=resized_width)
}

user_message = {
    "role": "user",
    "content": [
        {"type": "image", "image": resized_image},
        {"type": "text", "text": instruction}
    ]
}

# Tokenize and prepare inputs
image_inputs, video_inputs = process_vision_info([system_message, user_message])
text = processor.apply_chat_template([system_message, user_message], tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(model.device)

# Generate prediction
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, temperature=1.0, use_cache=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]

# Extract and rescale coordinates
pred_x, pred_y  = extract_coordinates(output_text)
pred_x*=scale_x
pred_y*=scale_y
print(pred_x,pred_y)

Refer to our code for more details.

Agent Performance

Refer to an inference example here.

Contact

Please contact yan.yang@anu.edu.au for any queries.

Acknowledgement

We thank the open-source projects: VLM-R1, Jedi, and Agent-S2.

Citation

If you use this repository or find it helpful in your research, please cite it as follows:

@misc{yang2025gta1guitesttimescaling,
      title={GTA1: GUI Test-time Scaling Agent},
      author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Caiming Xiong and Junnan Li},
      year={2025},
      eprint={2507.05791},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.05791},
}