|
|
--- |
|
|
license: other |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- Qwen/Qwen3-8B-Base |
|
|
- tencent/POINTS-Reader |
|
|
- WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat |
|
|
tags: |
|
|
- GUI |
|
|
- GUI-Grounding |
|
|
- Vision-language |
|
|
- multimodal |
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<img src="images/logo.png"/> |
|
|
<p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://huggingface.co/tencent/POINTS-GUI-G"> |
|
|
<img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"> |
|
|
</a> |
|
|
<a href="https://github.com/Tencent/POINTS-GUI"> |
|
|
<img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code"> |
|
|
</a> |
|
|
<a href="https://huggingface.co/papers/2602.06391"> |
|
|
<img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper"> |
|
|
</a> |
|
|
<a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view"> |
|
|
<img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view"> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
## News |
|
|
|
|
|
- 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned! |
|
|
- 🚀 2026.02.06: We are happy to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>. |
|
|
|
|
|
## Introduction |
|
|
|
|
|
1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. |
|
|
|
|
|
2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization. |
|
|
|
|
|
3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases |
|
|
|
|
|
## Results |
|
|
|
|
|
We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines. |
|
|
|
|
|
 |
|
|
|
|
|
## Examples |
|
|
|
|
|
### Prediction on desktop screenshots |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
### Prediction on mobile screenshots |
|
|
|
|
|
 |
|
|
|
|
|
### Prediction on web screenshots |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
## Getting Started |
|
|
|
|
|
This following code snippet has been tested with following environment: |
|
|
|
|
|
``` |
|
|
python==3.12.11 |
|
|
torch==2.9.1 |
|
|
transformers==4.57.1 |
|
|
cuda==12.6 |
|
|
``` |
|
|
|
|
|
### Run with Transformers |
|
|
|
|
|
Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command: |
|
|
|
|
|
```sh |
|
|
git clone https://github.com/WePOINTS/WePOINTS.git |
|
|
cd ./WePOINTS |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor |
|
|
import torch |
|
|
|
|
|
system_prompt_point = ( |
|
|
'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n' |
|
|
'Requirements for the output:\n' |
|
|
'- Return only the point (x, y) representing the center of the target element\n' |
|
|
'- Coordinates must be normalized to the range [0, 1]\n' |
|
|
'- Round each coordinate to three decimal places\n' |
|
|
'- Format the output as strictly (x, y) without any additional text\n' |
|
|
) |
|
|
system_prompt_bbox = ( |
|
|
'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n' |
|
|
'Requirements for the output:\n' |
|
|
'- Return only the bounding box coordinates (x0, y0, x1, y1)\n' |
|
|
'- Coordinates must be normalized to the range [0, 1]\n' |
|
|
'- Round each coordinate to three decimal places\n' |
|
|
'- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n' |
|
|
) |
|
|
system_prompt = system_prompt_point # system_prompt_bbox |
|
|
user_prompt = None # replace with your instruction (e.g., 'close the window') |
|
|
image_path = '/path/to/your/local/image' |
|
|
model_path = 'tencent/POINTS-GUI-G' |
|
|
model = AutoModelForCausalLM.from_pretrained(model_path, |
|
|
trust_remote_code=True, |
|
|
dtype=torch.bfloat16, |
|
|
device_map='cuda') |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path) |
|
|
content = [ |
|
|
dict(type='image', image=image_path), |
|
|
dict(type='text', text=user_prompt) |
|
|
] |
|
|
messages = [ |
|
|
{ |
|
|
'role': 'system', |
|
|
'content': [dict(type='text', text=system_prompt)] |
|
|
}, |
|
|
{ |
|
|
'role': 'user', |
|
|
'content': content |
|
|
} |
|
|
] |
|
|
generation_config = { |
|
|
'max_new_tokens': 2048, |
|
|
'do_sample': False |
|
|
} |
|
|
response = model.chat( |
|
|
messages, |
|
|
tokenizer, |
|
|
image_processor, |
|
|
generation_config |
|
|
) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Deploy with SGLang |
|
|
|
|
|
We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR. |
|
|
|
|
|
#### How to Deploy |
|
|
|
|
|
You can deploy POINTS-GUI-G with SGLang using the following command: |
|
|
|
|
|
``` |
|
|
python3 -m sglang.launch_server \ |
|
|
--model-path tencent/POINTS-GUI-G \ |
|
|
--tp-size 1 \ |
|
|
--dp-size 1 \ |
|
|
--chunked-prefill-size -1 \ |
|
|
--mem-fraction-static 0.7 \ |
|
|
--chat-template qwen2-vl \ |
|
|
--trust-remote-code \ |
|
|
--port 8081 |
|
|
``` |
|
|
|
|
|
#### How to Use |
|
|
|
|
|
You can use the following code to obtain results from SGLang: |
|
|
|
|
|
```python |
|
|
|
|
|
from typing import List |
|
|
import requests |
|
|
import json |
|
|
|
|
|
|
|
|
|
|
|
def call_wepoints(messages: List[dict], |
|
|
temperature: float = 0.0, |
|
|
max_new_tokens: int = 2048, |
|
|
repetition_penalty: float = 1.05, |
|
|
top_p: float = 0.8, |
|
|
top_k: int = 20, |
|
|
do_sample: bool = True, |
|
|
url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str: |
|
|
"""Query WePOINTS model to generate a response. |
|
|
|
|
|
Args: |
|
|
messages (List[dict]): A list of messages to be sent to WePOINTS. The |
|
|
messages should be the standard OpenAI messages, like: |
|
|
[ |
|
|
{ |
|
|
'role': 'user', |
|
|
'content': [ |
|
|
{ |
|
|
'type': 'text', |
|
|
'text': 'Please describe this image in short' |
|
|
}, |
|
|
{ |
|
|
'type': 'image_url', |
|
|
'image_url': {'url': /path/to/image.jpg} |
|
|
} |
|
|
] |
|
|
} |
|
|
] |
|
|
temperature (float, optional): The temperature of the model. |
|
|
Defaults to 0.0. |
|
|
max_new_tokens (int, optional): The maximum number of new tokens to generate. |
|
|
Defaults to 2048. |
|
|
repetition_penalty (float, optional): The penalty for repetition. |
|
|
Defaults to 1.05. |
|
|
top_p (float, optional): The top-p probability threshold. |
|
|
Defaults to 0.8. |
|
|
top_k (int, optional): The top-k sampling vocabulary size. |
|
|
Defaults to 20. |
|
|
do_sample (bool, optional): Whether to use sampling or greedy decoding. |
|
|
Defaults to True. |
|
|
url (str, optional): The URL of the WePOINTS model. |
|
|
Defaults to 'http://127.0.0.1:8081/v1/chat/completions'. |
|
|
|
|
|
Returns: |
|
|
str: The generated response from WePOINTS. |
|
|
""" |
|
|
data = { |
|
|
'model': 'WePoints', |
|
|
'messages': messages, |
|
|
'max_new_tokens': max_new_tokens, |
|
|
'temperature': temperature, |
|
|
'repetition_penalty': repetition_penalty, |
|
|
'top_p': top_p, |
|
|
'top_k': top_k, |
|
|
'do_sample': do_sample, |
|
|
} |
|
|
response = requests.post(url, |
|
|
json=data) |
|
|
response = json.loads(response.text) |
|
|
response = response['choices'][0]['message']['content'] |
|
|
return response |
|
|
|
|
|
system_prompt_point = ( |
|
|
'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n' |
|
|
'Requirements for the output:\n' |
|
|
'- Return only the point (x, y) representing the center of the target element\n' |
|
|
'- Coordinates must be normalized to the range [0, 1]\n' |
|
|
'- Round each coordinate to three decimal places\n' |
|
|
'- Format the output as strictly (x, y) without any additional text\n' |
|
|
) |
|
|
system_prompt_bbox = ( |
|
|
'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n' |
|
|
'Requirements for the output:\n' |
|
|
'- Return only the bounding box coordinates (x0, y0, x1, y1)\n' |
|
|
'- Coordinates must be normalized to the range [0, 1]\n' |
|
|
'- Round each coordinate to three decimal places\n' |
|
|
'- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n' |
|
|
) |
|
|
system_prompt = system_prompt_point # system_prompt_bbox |
|
|
user_prompt = None # replace with your instruction (e.g., 'close the window') |
|
|
|
|
|
messages = [ |
|
|
{ |
|
|
'role': 'system', |
|
|
'content': [ |
|
|
{ |
|
|
'type': 'text', |
|
|
'text': system_prompt |
|
|
} |
|
|
] |
|
|
}, |
|
|
{ |
|
|
'role': 'user', |
|
|
'content': [ |
|
|
{ |
|
|
'type': 'image_url', |
|
|
'image_url': {'url': '/path/to/image.jpg'} |
|
|
}, |
|
|
{ |
|
|
'type': 'text', |
|
|
'text': user_prompt |
|
|
} |
|
|
] |
|
|
} |
|
|
] |
|
|
response = call_wepoints(messages) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your work, please cite the following paper: |
|
|
|
|
|
``` |
|
|
@article{zhao2026pointsguigguigroundingjourney, |
|
|
title = {POINTS-GUI-G: GUI-Grounding Journey}, |
|
|
author = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie}, |
|
|
journal = {arXiv preprint arXiv:2602.06391}, |
|
|
year = {2026} |
|
|
} |
|
|
|
|
|
@inproceedings{liu2025points, |
|
|
title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion}, |
|
|
author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others}, |
|
|
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, |
|
|
pages={1576--1601}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
@article{liu2024points1, |
|
|
title={POINTS1. 5: Building a Vision-Language Model towards Real World Applications}, |
|
|
author={Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie}, |
|
|
journal={arXiv preprint arXiv:2412.08443}, |
|
|
year={2024} |
|
|
} |
|
|
|
|
|
@article{liu2024points, |
|
|
title={POINTS: Improving Your Vision-language Model with Affordable Strategies}, |
|
|
author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie}, |
|
|
journal={arXiv preprint arXiv:2409.04828}, |
|
|
year={2024} |
|
|
} |
|
|
|
|
|
@article{liu2024rethinking, |
|
|
title={Rethinking Overlooked Aspects in Vision-Language Models}, |
|
|
author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie}, |
|
|
journal={arXiv preprint arXiv:2405.11850}, |
|
|
year={2024} |
|
|
} |
|
|
``` |