nielsr HF Staff commited on
Commit
626e819
·
verified ·
1 Parent(s): c983452

Add library_name and pipeline_tag to metadata

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

This PR improves the metadata of your model card by adding the `library_name: transformers` and the `image-text-to-text` pipeline tag. These additions will enable the "Use in Transformers" button on the model page and help users find the model more easily through task-based filtering.

I've also kept the existing usage examples and benchmark results from your README.

Files changed (1) hide show
  1. README.md +38 -209
README.md CHANGED
@@ -1,14 +1,16 @@
1
  ---
2
- license: other
 
 
 
3
  language:
4
  - en
5
  - zh
 
6
  metrics:
7
  - accuracy
8
- base_model:
9
- - Qwen/Qwen3-8B-Base
10
- - tencent/POINTS-Reader
11
- - WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
12
  tags:
13
  - GUI
14
  - GUI-Grounding
@@ -42,11 +44,13 @@ tags:
42
 
43
  ## Introduction
44
 
 
 
45
  1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
46
 
47
- 2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
48
 
49
- 3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases
50
 
51
  ## Results
52
 
@@ -54,35 +58,8 @@ We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: Screen
54
 
55
  ![Example 1](images/results.png)
56
 
57
- ## Examples
58
-
59
- ### Prediction on desktop screenshots
60
-
61
- ![Example 1](images/example_desktop_1.png)
62
- ![Example 1](images/example_desktop_2.png)
63
- ![Example 1](images/example_desktop_3.png)
64
-
65
- ### Prediction on mobile screenshots
66
-
67
- ![Example 1](images/example_mobile.png)
68
-
69
- ### Prediction on web screenshots
70
-
71
- ![Example 1](images/example_web_1.png)
72
- ![Example 1](images/example_web_2.png)
73
- ![Example 1](images/example_web_3.png)
74
-
75
  ## Getting Started
76
 
77
- This following code snippet has been tested with following environment:
78
-
79
- ```
80
- python==3.12.11
81
- torch==2.9.1
82
- transformers==4.57.1
83
- cuda==12.6
84
- ```
85
-
86
  ### Run with Transformers
87
 
88
  Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
@@ -98,23 +75,37 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProces
98
  import torch
99
 
100
  system_prompt_point = (
101
- 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
102
- 'Requirements for the output:\n'
103
- '- Return only the point (x, y) representing the center of the target element\n'
104
- '- Coordinates must be normalized to the range [0, 1]\n'
105
- '- Round each coordinate to three decimal places\n'
106
- '- Format the output as strictly (x, y) without any additional text\n'
 
 
 
 
 
 
 
107
  )
108
  system_prompt_bbox = (
109
- 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
110
- 'Requirements for the output:\n'
111
- '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
112
- '- Coordinates must be normalized to the range [0, 1]\n'
113
- '- Round each coordinate to three decimal places\n'
114
- '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
 
 
 
 
 
 
 
115
  )
116
  system_prompt = system_prompt_point # system_prompt_bbox
117
- user_prompt = None # replace with your instruction (e.g., 'close the window')
118
  image_path = '/path/to/your/local/image'
119
  model_path = 'tencent/POINTS-GUI-G'
120
  model = AutoModelForCausalLM.from_pretrained(model_path,
@@ -150,147 +141,6 @@ response = model.chat(
150
  print(response)
151
  ```
152
 
153
- ### Deploy with SGLang
154
-
155
- We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR.
156
-
157
- #### How to Deploy
158
-
159
- You can deploy POINTS-GUI-G with SGLang using the following command:
160
-
161
- ```
162
- python3 -m sglang.launch_server \
163
- --model-path tencent/POINTS-GUI-G \
164
- --tp-size 1 \
165
- --dp-size 1 \
166
- --chunked-prefill-size -1 \
167
- --mem-fraction-static 0.7 \
168
- --chat-template qwen2-vl \
169
- --trust-remote-code \
170
- --port 8081
171
- ```
172
-
173
- #### How to Use
174
-
175
- You can use the following code to obtain results from SGLang:
176
-
177
- ```python
178
-
179
- from typing import List
180
- import requests
181
- import json
182
-
183
-
184
-
185
- def call_wepoints(messages: List[dict],
186
- temperature: float = 0.0,
187
- max_new_tokens: int = 2048,
188
- repetition_penalty: float = 1.05,
189
- top_p: float = 0.8,
190
- top_k: int = 20,
191
- do_sample: bool = True,
192
- url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
193
- """Query WePOINTS model to generate a response.
194
-
195
- Args:
196
- messages (List[dict]): A list of messages to be sent to WePOINTS. The
197
- messages should be the standard OpenAI messages, like:
198
- [
199
- {
200
- 'role': 'user',
201
- 'content': [
202
- {
203
- 'type': 'text',
204
- 'text': 'Please describe this image in short'
205
- },
206
- {
207
- 'type': 'image_url',
208
- 'image_url': {'url': /path/to/image.jpg}
209
- }
210
- ]
211
- }
212
- ]
213
- temperature (float, optional): The temperature of the model.
214
- Defaults to 0.0.
215
- max_new_tokens (int, optional): The maximum number of new tokens to generate.
216
- Defaults to 2048.
217
- repetition_penalty (float, optional): The penalty for repetition.
218
- Defaults to 1.05.
219
- top_p (float, optional): The top-p probability threshold.
220
- Defaults to 0.8.
221
- top_k (int, optional): The top-k sampling vocabulary size.
222
- Defaults to 20.
223
- do_sample (bool, optional): Whether to use sampling or greedy decoding.
224
- Defaults to True.
225
- url (str, optional): The URL of the WePOINTS model.
226
- Defaults to 'http://127.0.0.1:8081/v1/chat/completions'.
227
-
228
- Returns:
229
- str: The generated response from WePOINTS.
230
- """
231
- data = {
232
- 'model': 'WePoints',
233
- 'messages': messages,
234
- 'max_new_tokens': max_new_tokens,
235
- 'temperature': temperature,
236
- 'repetition_penalty': repetition_penalty,
237
- 'top_p': top_p,
238
- 'top_k': top_k,
239
- 'do_sample': do_sample,
240
- }
241
- response = requests.post(url,
242
- json=data)
243
- response = json.loads(response.text)
244
- response = response['choices'][0]['message']['content']
245
- return response
246
-
247
- system_prompt_point = (
248
- 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
249
- 'Requirements for the output:\n'
250
- '- Return only the point (x, y) representing the center of the target element\n'
251
- '- Coordinates must be normalized to the range [0, 1]\n'
252
- '- Round each coordinate to three decimal places\n'
253
- '- Format the output as strictly (x, y) without any additional text\n'
254
- )
255
- system_prompt_bbox = (
256
- 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
257
- 'Requirements for the output:\n'
258
- '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
259
- '- Coordinates must be normalized to the range [0, 1]\n'
260
- '- Round each coordinate to three decimal places\n'
261
- '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
262
- )
263
- system_prompt = system_prompt_point # system_prompt_bbox
264
- user_prompt = None # replace with your instruction (e.g., 'close the window')
265
-
266
- messages = [
267
- {
268
- 'role': 'system',
269
- 'content': [
270
- {
271
- 'type': 'text',
272
- 'text': system_prompt
273
- }
274
- ]
275
- },
276
- {
277
- 'role': 'user',
278
- 'content': [
279
- {
280
- 'type': 'image_url',
281
- 'image_url': {'url': '/path/to/image.jpg'}
282
- },
283
- {
284
- 'type': 'text',
285
- 'text': user_prompt
286
- }
287
- ]
288
- }
289
- ]
290
- response = call_wepoints(messages)
291
- print(response)
292
- ```
293
-
294
  ## Citation
295
 
296
  If you use this model in your work, please cite the following paper:
@@ -310,25 +160,4 @@ If you use this model in your work, please cite the following paper:
310
  pages={1576--1601},
311
  year={2025}
312
  }
313
-
314
- @article{liu2024points1,
315
- title={POINTS1. 5: Building a Vision-Language Model towards Real World Applications},
316
- author={Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie},
317
- journal={arXiv preprint arXiv:2412.08443},
318
- year={2024}
319
- }
320
-
321
- @article{liu2024points,
322
- title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
323
- author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
324
- journal={arXiv preprint arXiv:2409.04828},
325
- year={2024}
326
- }
327
-
328
- @article{liu2024rethinking,
329
- title={Rethinking Overlooked Aspects in Vision-Language Models},
330
- author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
331
- journal={arXiv preprint arXiv:2405.11850},
332
- year={2024}
333
- }
334
  ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-8B-Base
4
+ - tencent/POINTS-Reader
5
+ - WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
6
  language:
7
  - en
8
  - zh
9
+ license: other
10
  metrics:
11
  - accuracy
12
+ library_name: transformers
13
+ pipeline_tag: image-text-to-text
 
 
14
  tags:
15
  - GUI
16
  - GUI-Grounding
 
44
 
45
  ## Introduction
46
 
47
+ POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391).
48
+
49
  1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
50
 
51
+ 2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
52
 
53
+ 3. **Refined Data Engineering**: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.
54
 
55
  ## Results
56
 
 
58
 
59
  ![Example 1](images/results.png)
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ## Getting Started
62
 
 
 
 
 
 
 
 
 
 
63
  ### Run with Transformers
64
 
65
  Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
 
75
  import torch
76
 
77
  system_prompt_point = (
78
+ 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.
79
+
80
+ '
81
+ 'Requirements for the output:
82
+ '
83
+ '- Return only the point (x, y) representing the center of the target element
84
+ '
85
+ '- Coordinates must be normalized to the range [0, 1]
86
+ '
87
+ '- Round each coordinate to three decimal places
88
+ '
89
+ '- Format the output as strictly (x, y) without any additional text
90
+ '
91
  )
92
  system_prompt_bbox = (
93
+ 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.
94
+
95
+ '
96
+ 'Requirements for the output:
97
+ '
98
+ '- Return only the bounding box coordinates (x0, y0, x1, y1)
99
+ '
100
+ '- Coordinates must be normalized to the range [0, 1]
101
+ '
102
+ '- Round each coordinate to three decimal places
103
+ '
104
+ '- Format the output as strictly (x0, y0, x1, y1) without any additional text.
105
+ '
106
  )
107
  system_prompt = system_prompt_point # system_prompt_bbox
108
+ user_prompt = "Click the 'Login' button" # replace with your instruction
109
  image_path = '/path/to/your/local/image'
110
  model_path = 'tencent/POINTS-GUI-G'
111
  model = AutoModelForCausalLM.from_pretrained(model_path,
 
141
  print(response)
142
  ```
143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
  ## Citation
145
 
146
  If you use this model in your work, please cite the following paper:
 
160
  pages={1576--1601},
161
  year={2025}
162
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  ```