File size: 12,585 Bytes
acce0c4
 
 
 
 
 
 
 
 
c7fd392
 
acce0c4
 
 
 
8e456dd
d57b3b2
 
 
 
 
 
 
 
 
 
 
 
 
c983452
d57b3b2
 
 
 
 
 
 
 
 
 
8e456dd
d57b3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c983452
 
 
 
 
 
 
d57b3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e456dd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
license: other
language:
- en
- zh
metrics:
- accuracy
base_model:
- Qwen/Qwen3-8B-Base
- tencent/POINTS-Reader
- WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
tags:
- GUI
- GUI-Grounding
- Vision-language
- multimodal
---

<p align="center">
    <img src="images/logo.png"/>
<p>

<p align="center">
  <a href="https://huggingface.co/tencent/POINTS-GUI-G">
    <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
  </a>
  <a href="https://github.com/Tencent/POINTS-GUI">
    <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
  </a>
  <a href="https://huggingface.co/papers/2602.06391">
    <img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
  </a>
  <a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
    <img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
  </a>
</p>

## News

- 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
- 🚀 2026.02.06: We are happy to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.

## Introduction

1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.

2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.

3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases

## Results

We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.

![Example 1](images/results.png)

## Examples

### Prediction on desktop screenshots

![Example 1](images/example_desktop_1.png)
![Example 1](images/example_desktop_2.png)
![Example 1](images/example_desktop_3.png)

### Prediction on mobile screenshots

![Example 1](images/example_mobile.png)

### Prediction on web screenshots

![Example 1](images/example_web_1.png)
![Example 1](images/example_web_2.png)
![Example 1](images/example_web_3.png)

## Getting Started

This following code snippet has been tested with following environment:

```
python==3.12.11
torch==2.9.1
transformers==4.57.1
cuda==12.6
```

### Run with Transformers

Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:

```sh
git clone https://github.com/WePOINTS/WePOINTS.git
cd ./WePOINTS
pip install -e .
```

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
import torch

system_prompt_point = (
    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
    'Requirements for the output:\n'
    '- Return only the point (x, y) representing the center of the target element\n'
    '- Coordinates must be normalized to the range [0, 1]\n'
    '- Round each coordinate to three decimal places\n'
    '- Format the output as strictly (x, y) without any additional text\n'
)
system_prompt_bbox = (
    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
    'Requirements for the output:\n'
    '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
    '- Coordinates must be normalized to the range [0, 1]\n'
    '- Round each coordinate to three decimal places\n'
    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
)
system_prompt = system_prompt_point  # system_prompt_bbox
user_prompt = None  # replace with your instruction (e.g., 'close the window')
image_path = '/path/to/your/local/image'
model_path = 'tencent/POINTS-GUI-G'
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             trust_remote_code=True,
                                             dtype=torch.bfloat16,
                                             device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
content = [
            dict(type='image', image=image_path),
            dict(type='text', text=user_prompt)
          ]
messages = [
        {
            'role': 'system',
            'content': [dict(type='text', text=system_prompt)]
        },
        {
            'role': 'user',
            'content': content
        }
    ]
generation_config = {
        'max_new_tokens': 2048,
        'do_sample': False
    }
response = model.chat(
    messages,
    tokenizer,
    image_processor,
    generation_config
)
print(response)
```

### Deploy with SGLang

We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR.

#### How to Deploy

You can deploy POINTS-GUI-G with SGLang using the following command:

```
python3 -m sglang.launch_server \
--model-path tencent/POINTS-GUI-G \
--tp-size 1 \
--dp-size 1 \
--chunked-prefill-size -1 \
--mem-fraction-static 0.7 \
--chat-template qwen2-vl \
--trust-remote-code \
--port 8081
```

#### How to Use

You can use the following code to obtain results from SGLang:

```python

from typing import List
import requests
import json



def call_wepoints(messages: List[dict],
                 temperature: float = 0.0,
                 max_new_tokens: int = 2048,
                 repetition_penalty: float = 1.05,
                 top_p: float = 0.8,
                 top_k: int = 20,
                 do_sample: bool = True,
                 url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
    """Query WePOINTS model to generate a response.

    Args:
        messages (List[dict]): A list of messages to be sent to WePOINTS. The
            messages should be the standard OpenAI messages, like:
            [
                {
                    'role': 'user',
                    'content': [
                        {
                            'type': 'text',
                            'text': 'Please describe this image in short'
                        },
                        {
                            'type': 'image_url',
                            'image_url': {'url': /path/to/image.jpg}
                        }
                    ]
                }
            ]
        temperature (float, optional): The temperature of the model.
            Defaults to 0.0.
        max_new_tokens (int, optional): The maximum number of new tokens to generate.
            Defaults to 2048.
        repetition_penalty (float, optional): The penalty for repetition.
            Defaults to 1.05.
        top_p (float, optional): The top-p probability threshold.
            Defaults to 0.8.
        top_k (int, optional): The top-k sampling vocabulary size.
            Defaults to 20.
        do_sample (bool, optional): Whether to use sampling or greedy decoding.
            Defaults to True.
        url (str, optional): The URL of the WePOINTS model.
            Defaults to 'http://127.0.0.1:8081/v1/chat/completions'.

    Returns:
        str: The generated response from WePOINTS.
    """
    data = {
        'model': 'WePoints',
        'messages': messages,
        'max_new_tokens': max_new_tokens,
        'temperature': temperature,
        'repetition_penalty': repetition_penalty,
        'top_p': top_p,
        'top_k': top_k,
        'do_sample': do_sample,
    }
    response = requests.post(url,
                             json=data)
    response = json.loads(response.text)
    response = response['choices'][0]['message']['content']
    return response

system_prompt_point = (
    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
    'Requirements for the output:\n'
    '- Return only the point (x, y) representing the center of the target element\n'
    '- Coordinates must be normalized to the range [0, 1]\n'
    '- Round each coordinate to three decimal places\n'
    '- Format the output as strictly (x, y) without any additional text\n'
)
system_prompt_bbox = (
    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
    'Requirements for the output:\n'
    '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
    '- Coordinates must be normalized to the range [0, 1]\n'
    '- Round each coordinate to three decimal places\n'
    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
)
system_prompt = system_prompt_point  # system_prompt_bbox
user_prompt = None  # replace with your instruction (e.g., 'close the window')

messages = [
            {
              'role': 'system',
              'content': [
                  {
                      'type': 'text',
                      'text': system_prompt
                  }
              ]
            },
            {
              'role': 'user',
              'content': [
                  {
                      'type': 'image_url',
                      'image_url': {'url': '/path/to/image.jpg'}
                  },
                  {
                      'type': 'text',
                      'text': user_prompt
                  }
              ]
            }
           ]
response = call_wepoints(messages)
print(response)
```

## Citation

If you use this model in your work, please cite the following paper:

```
@article{zhao2026pointsguigguigroundingjourney,
  title   = {POINTS-GUI-G: GUI-Grounding Journey},
  author  = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie},
  journal = {arXiv preprint arXiv:2602.06391},
  year    = {2026}
}

@inproceedings{liu2025points,
  title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
  author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={1576--1601},
  year={2025}
}

@article{liu2024points1,
  title={POINTS1. 5: Building a Vision-Language Model towards Real World Applications},
  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie},
  journal={arXiv preprint arXiv:2412.08443},
  year={2024}
}

@article{liu2024points,
  title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
  author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2409.04828},
  year={2024}
}

@article{liu2024rethinking,
  title={Rethinking Overlooked Aspects in Vision-Language Models},
  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
  journal={arXiv preprint arXiv:2405.11850},
  year={2024}
}
```