scyr commited on
Commit
d57b3b2
·
verified ·
1 Parent(s): fb151a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +311 -1
README.md CHANGED
@@ -13,4 +13,314 @@ tags:
13
  - GUI
14
  - GUI-Grounding
15
  - Vision-language
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - GUI
14
  - GUI-Grounding
15
  - Vision-language
16
+ ---
17
+
18
+ <p align="center">
19
+ <img src="images/logo.png"/>
20
+ <p>
21
+
22
+ <p align="center">
23
+ <a href="https://huggingface.co/tencent/POINTS-GUI-G">
24
+ <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
25
+ </a>
26
+ <a href="https://github.com/Tencent/POINTS-GUI">
27
+ <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
28
+ </a>
29
+ <a href="coming soon">
30
+ <img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
31
+ </a>
32
+ <a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
33
+ <img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
34
+ </a>
35
+ </p>
36
+
37
+ ## News
38
+
39
+ - 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
40
+ - 🚀 2026.02.06: We are pleased to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.
41
+
42
+ ## Introduction
43
+
44
+ 1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
45
+
46
+ 2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
47
+
48
+ 3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases
49
+
50
+ ## Results
51
+
52
+ We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.
53
+
54
+ ![Example 1](images/results.png)
55
+
56
+ ## Examples
57
+
58
+ ### Prediction on desktop screenshots
59
+
60
+ ![Example 1](images/example_desktop_1.png)
61
+ ![Example 1](images/example_desktop_2.png)
62
+ ![Example 1](images/example_desktop_3.png)
63
+
64
+ ### Prediction on mobile screenshots
65
+
66
+ ![Example 1](images/example_mobile.png)
67
+
68
+ ### Prediction on web screenshots
69
+
70
+ ![Example 1](images/example_web_1.png)
71
+ ![Example 1](images/example_web_2.png)
72
+ ![Example 1](images/example_web_3.png)
73
+
74
+ ## Getting Started
75
+
76
+ This following code snippet has been tested with following environment:
77
+
78
+ ```
79
+ python==3.12.11
80
+ torch==2.9.1
81
+ transformers==4.57.1
82
+ cuda==12.6
83
+ ```
84
+
85
+ ### Run with Transformers
86
+
87
+ Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
88
+
89
+ ```sh
90
+ git clone https://github.com/WePOINTS/WePOINTS.git
91
+ cd ./WePOINTS
92
+ pip install -e .
93
+ ```
94
+
95
+ ```python
96
+ from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
97
+ import torch
98
+
99
+ system_prompt_point = (
100
+ 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
101
+ 'Requirements for the output:\n'
102
+ '- Return only the point (x, y) representing the center of the target element\n'
103
+ '- Coordinates must be normalized to the range [0, 1]\n'
104
+ '- Round each coordinate to three decimal places\n'
105
+ '- Format the output as strictly (x, y) without any additional text\n'
106
+ )
107
+ system_prompt_bbox = (
108
+ 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
109
+ 'Requirements for the output:\n'
110
+ '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
111
+ '- Coordinates must be normalized to the range [0, 1]\n'
112
+ '- Round each coordinate to three decimal places\n'
113
+ '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
114
+ )
115
+ system_prompt = system_prompt_point # system_prompt_bbox
116
+ user_prompt = None # replace with your instruction (e.g., 'close the window')
117
+ image_path = '/path/to/your/local/image'
118
+ model_path = 'tencent/POINTS-GUI-G'
119
+ model = AutoModelForCausalLM.from_pretrained(model_path,
120
+ trust_remote_code=True,
121
+ dtype=torch.bfloat16,
122
+ device_map='cuda')
123
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
124
+ image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
125
+ content = [
126
+ dict(type='image', image=image_path),
127
+ dict(type='text', text=user_prompt)
128
+ ]
129
+ messages = [
130
+ {
131
+ 'role': 'system',
132
+ 'content': [dict(type='text', text=system_prompt)]
133
+ },
134
+ {
135
+ 'role': 'user',
136
+ 'content': content
137
+ }
138
+ ]
139
+ generation_config = {
140
+ 'max_new_tokens': 2048,
141
+ 'do_sample': False
142
+ }
143
+ response = model.chat(
144
+ messages,
145
+ tokenizer,
146
+ image_processor,
147
+ generation_config
148
+ )
149
+ print(response)
150
+ ```
151
+
152
+ ### Deploy with SGLang
153
+
154
+ We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR.
155
+
156
+ #### How to Deploy
157
+
158
+ You can deploy POINTS-GUI-G with SGLang using the following command:
159
+
160
+ ```
161
+ python3 -m sglang.launch_server \
162
+ --model-path tencent/POINTS-GUI-G \
163
+ --tp-size 1 \
164
+ --dp-size 1 \
165
+ --chunked-prefill-size -1 \
166
+ --mem-fraction-static 0.7 \
167
+ --chat-template qwen2-vl \
168
+ --trust-remote-code \
169
+ --port 8081
170
+ ```
171
+
172
+ #### How to Use
173
+
174
+ You can use the following code to obtain results from SGLang:
175
+
176
+ ```python
177
+
178
+ from typing import List
179
+ import requests
180
+ import json
181
+
182
+
183
+
184
+ def call_wepoints(messages: List[dict],
185
+ temperature: float = 0.0,
186
+ max_new_tokens: int = 2048,
187
+ repetition_penalty: float = 1.05,
188
+ top_p: float = 0.8,
189
+ top_k: int = 20,
190
+ do_sample: bool = True,
191
+ url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
192
+ """Query WePOINTS model to generate a response.
193
+
194
+ Args:
195
+ messages (List[dict]): A list of messages to be sent to WePOINTS. The
196
+ messages should be the standard OpenAI messages, like:
197
+ [
198
+ {
199
+ 'role': 'user',
200
+ 'content': [
201
+ {
202
+ 'type': 'text',
203
+ 'text': 'Please describe this image in short'
204
+ },
205
+ {
206
+ 'type': 'image_url',
207
+ 'image_url': {'url': /path/to/image.jpg}
208
+ }
209
+ ]
210
+ }
211
+ ]
212
+ temperature (float, optional): The temperature of the model.
213
+ Defaults to 0.0.
214
+ max_new_tokens (int, optional): The maximum number of new tokens to generate.
215
+ Defaults to 2048.
216
+ repetition_penalty (float, optional): The penalty for repetition.
217
+ Defaults to 1.05.
218
+ top_p (float, optional): The top-p probability threshold.
219
+ Defaults to 0.8.
220
+ top_k (int, optional): The top-k sampling vocabulary size.
221
+ Defaults to 20.
222
+ do_sample (bool, optional): Whether to use sampling or greedy decoding.
223
+ Defaults to True.
224
+ url (str, optional): The URL of the WePOINTS model.
225
+ Defaults to 'http://127.0.0.1:8081/v1/chat/completions'.
226
+
227
+ Returns:
228
+ str: The generated response from WePOINTS.
229
+ """
230
+ data = {
231
+ 'model': 'WePoints',
232
+ 'messages': messages,
233
+ 'max_new_tokens': max_new_tokens,
234
+ 'temperature': temperature,
235
+ 'repetition_penalty': repetition_penalty,
236
+ 'top_p': top_p,
237
+ 'top_k': top_k,
238
+ 'do_sample': do_sample,
239
+ }
240
+ response = requests.post(url,
241
+ json=data)
242
+ response = json.loads(response.text)
243
+ response = response['choices'][0]['message']['content']
244
+ return response
245
+
246
+ system_prompt_point = (
247
+ 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
248
+ 'Requirements for the output:\n'
249
+ '- Return only the point (x, y) representing the center of the target element\n'
250
+ '- Coordinates must be normalized to the range [0, 1]\n'
251
+ '- Round each coordinate to three decimal places\n'
252
+ '- Format the output as strictly (x, y) without any additional text\n'
253
+ )
254
+ system_prompt_bbox = (
255
+ 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
256
+ 'Requirements for the output:\n'
257
+ '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
258
+ '- Coordinates must be normalized to the range [0, 1]\n'
259
+ '- Round each coordinate to three decimal places\n'
260
+ '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
261
+ )
262
+ system_prompt = system_prompt_point # system_prompt_bbox
263
+ user_prompt = None # replace with your instruction (e.g., 'close the window')
264
+
265
+ messages = [
266
+ {
267
+ 'role': 'system',
268
+ 'content': [
269
+ {
270
+ 'type': 'text',
271
+ 'text': system_prompt
272
+ }
273
+ ]
274
+ },
275
+ {
276
+ 'role': 'user',
277
+ 'content': [
278
+ {
279
+ 'type': 'image_url',
280
+ 'image_url': {'url': '/path/to/image.jpg'}
281
+ },
282
+ {
283
+ 'type': 'text',
284
+ 'text': user_prompt
285
+ }
286
+ ]
287
+ }
288
+ ]
289
+ response = call_wepoints(messages)
290
+ print(response)
291
+ ```
292
+
293
+ ## Citation
294
+
295
+ If you use this model in your work, please cite the following paper:
296
+
297
+ ```
298
+ @inproceedings{liu2025points,
299
+ title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
300
+ author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
301
+ booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
302
+ pages={1576--1601},
303
+ year={2025}
304
+ }
305
+
306
+ @article{liu2024points1,
307
+ title={POINTS1. 5: Building a Vision-Language Model towards Real World Applications},
308
+ author={Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie},
309
+ journal={arXiv preprint arXiv:2412.08443},
310
+ year={2024}
311
+ }
312
+
313
+ @article{liu2024points,
314
+ title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
315
+ author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
316
+ journal={arXiv preprint arXiv:2409.04828},
317
+ year={2024}
318
+ }
319
+
320
+ @article{liu2024rethinking,
321
+ title={Rethinking Overlooked Aspects in Vision-Language Models},
322
+ author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
323
+ journal={arXiv preprint arXiv:2405.11850},
324
+ year={2024}
325
+ }
326
+ ```