Safetensors
qwen2_5_vl

Enhance model card for UI-AGILE: Add metadata, links, abstract, and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +161 -3
README.md CHANGED
@@ -1,3 +1,161 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
8
+
9
+ <div align="center">
10
+
11
+ [[๐Ÿ“– Paper](https://huggingface.co/papers/2507.22025)] [[๐Ÿš€ GitHub](https://github.com/KDEGroup/UI-AGILE)] [[๐ŸŒ Homepage](https://osatlas.github.io/)]
12
+
13
+ </div>
14
+
15
+ ## Abstract
16
+
17
+ The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE for enhancing GUI agents at both training and inference. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a continuous reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a cropping-based resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art grounding performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2 while it also exhibits strong general agent capabilities. For instance, using both our training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro. We provide the code in this https URL .
18
+
19
+ ## Overview
20
+
21
+ UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and **Cropping-based Resampling**, and inference with **Decomposed Grounding with Selection**.
22
+
23
+ <img src="https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494" alt="UI-AGILE Overview">
24
+
25
+ Trained on about only **9k** samples for just **2 epochs**, UI-AGILE shows superior performance, while also showcasing strong general agent capabilities. Furthermore, our inference method can act as a **plug-and-play enhancement** for a wide range of existing agents, improving the accuracy of some existing open-source models.
26
+
27
+ As a baseline, the standard grounding approach applied to UI-AGILE-7B completes the benchmark in **30 minutes**. When applying our method, the decomposed grounding stage takes **35 minutes**. The subsequent VLM-based selection stage requires additional **4 minutes**. The modest increase in overhead is a practical trade-off for the substantial gain of grounding accuracy brought by our method.
28
+
29
+ ## Quick Start
30
+
31
+ This section provides instructions on how to inference our pre-trained grounding models. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
32
+
33
+ ### Using OS-Atlas-Base-4B
34
+
35
+ First, install the `transformers` library:
36
+ ```bash
37
+ pip install transformers
38
+ ```
39
+ For additional dependencies, please refer to the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html)
40
+
41
+ Inference code example:
42
+ ```python
43
+ import numpy as np
44
+ import torch
45
+ import torchvision.transforms as T
46
+ from PIL import Image
47
+ from torchvision.transforms.functional import InterpolationMode
48
+ from transformers import AutoModel, AutoTokenizer
49
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
50
+ IMAGENET_STD = (0.229, 0.224, 0.225)
51
+
52
+ def build_transform(input_size):
53
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
54
+ transform = T.Compose([
55
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
56
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
57
+ T.ToTensor(),
58
+ T.Normalize(mean=MEAN, std=STD)
59
+ ])
60
+ return transform
61
+
62
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
63
+ best_ratio_diff = float('inf')
64
+ best_ratio = (1, 1)
65
+ area = width * height
66
+ for ratio in target_ratios:
67
+ target_aspect_ratio = ratio[0] / ratio[1]
68
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
69
+ if ratio_diff < best_ratio_diff:
70
+ best_ratio_diff = ratio_diff
71
+ best_ratio = ratio
72
+ elif ratio_diff == best_ratio_diff:
73
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
74
+ best_ratio = ratio
75
+ return best_ratio
76
+
77
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
78
+ orig_width, orig_height = image.size
79
+ aspect_ratio = orig_width / orig_height
80
+
81
+ # calculate the existing image aspect ratio
82
+ target_ratios = set(
83
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
84
+ i * j <= max_num and i * j >= min_num)
85
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
86
+
87
+ # find the closest aspect ratio to the target
88
+ target_aspect_ratio = find_closest_aspect_ratio(
89
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
90
+
91
+ # calculate the target width and height
92
+ target_width = image_size * target_aspect_ratio[0]
93
+ target_height = image_size * target_aspect_ratio[1]
94
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
95
+
96
+ # resize the image
97
+ resized_img = image.resize((target_width, target_height))
98
+ processed_images = []
99
+ for i in range(blocks):
100
+ box = (
101
+ (i % (target_width // image_size)) * image_size,
102
+ (i // (target_width // image_size)) * image_size,
103
+ ((i % (target_width // image_size)) + 1) * image_size,
104
+ ((i // (target_width // image_size)) + 1) * image_size
105
+ )
106
+ # split the image
107
+ split_img = resized_img.crop(box)
108
+ processed_images.append(split_img)
109
+ assert len(processed_images) == blocks
110
+ if use_thumbnail and len(processed_images) != 1:
111
+ thumbnail_img = image.resize((image_size, image_size))
112
+ processed_images.append(thumbnail_img)
113
+ return processed_images
114
+
115
+ def load_image(image_file, input_size=448, max_num=12):
116
+ image = Image.open(image_file).convert('RGB')
117
+ transform = build_transform(input_size=input_size)
118
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
119
+ pixel_values = [transform(image) for image in images]
120
+ pixel_values = torch.stack(pixel_values)
121
+ return pixel_values
122
+
123
+ # If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section in the GitHub repo.
124
+ path = 'KDEGroup/UI-AGILE'
125
+ model = AutoModel.from_pretrained(
126
+ path,
127
+ torch_dtype=torch.bfloat16,
128
+ low_cpu_mem_usage=True,
129
+ trust_remote_code=True).eval().cuda()
130
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
131
+
132
+ # set the max number of tiles in `max_num`
133
+ pixel_values = load_image('./examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
134
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
135
+
136
+ question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\
137
+ \\\"'Champions League' link\\\"\""
138
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
139
+ print(f'User: {question}\
140
+ Assistant: {response}')
141
+ ```
142
+
143
+ ## Citation
144
+
145
+ If you find this project useful, welcome to cite us.
146
+
147
+ ```bibtex
148
+ @misc{lian2025uiagileadvancingguiagents,
149
+ title={UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding},
150
+ author={Shuquan Lian and Yuhang Wu and Jia Ma and Zihan Song and Bingqi Chen and Xiawu Zheng and Hui Li},
151
+ year={2025},
152
+ eprint={2507.22025},
153
+ archivePrefix={arXiv},
154
+ primaryClass={cs.AI},
155
+ url={https://arxiv.org/abs/2507.22025},
156
+ }
157
+ ```
158
+
159
+ ## Acknowledgements
160
+
161
+ We sincerely thank projects [R1-V](https://github.com/Deep-Agent/R1-V), [Open-R1](https://github.com/huggingface/open-r1), and [Open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal), [VLM-R1](https://github.com/om-ai-lab/VLM-R1) for providing their open-source resources.