File size: 6,354 Bytes
acce0c4 626e819 acce0c4 626e819 acce0c4 626e819 acce0c4 8e456dd d57b3b2 c983452 d57b3b2 8e456dd d57b3b2 626e819 d57b3b2 626e819 d57b3b2 626e819 d57b3b2 626e819 d57b3b2 626e819 d57b3b2 626e819 d57b3b2 c983452 d57b3b2 8e456dd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | ---
base_model:
- Qwen/Qwen3-8B-Base
- tencent/POINTS-Reader
- WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
language:
- en
- zh
license: other
metrics:
- accuracy
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- GUI
- GUI-Grounding
- Vision-language
- multimodal
---
<p align="center">
<img src="images/logo.png"/>
<p>
<p align="center">
<a href="https://huggingface.co/tencent/POINTS-GUI-G">
<img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
</a>
<a href="https://github.com/Tencent/POINTS-GUI">
<img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
</a>
<a href="https://huggingface.co/papers/2602.06391">
<img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
</a>
<a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
<img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
</a>
</p>
## News
- 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
- 🚀 2026.02.06: We are happy to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.
## Introduction
POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391).
1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
3. **Refined Data Engineering**: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.
## Results
We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.

## Getting Started
### Run with Transformers
Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
```sh
git clone https://github.com/WePOINTS/WePOINTS.git
cd ./WePOINTS
pip install -e .
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
import torch
system_prompt_point = (
'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.
'
'Requirements for the output:
'
'- Return only the point (x, y) representing the center of the target element
'
'- Coordinates must be normalized to the range [0, 1]
'
'- Round each coordinate to three decimal places
'
'- Format the output as strictly (x, y) without any additional text
'
)
system_prompt_bbox = (
'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.
'
'Requirements for the output:
'
'- Return only the bounding box coordinates (x0, y0, x1, y1)
'
'- Coordinates must be normalized to the range [0, 1]
'
'- Round each coordinate to three decimal places
'
'- Format the output as strictly (x0, y0, x1, y1) without any additional text.
'
)
system_prompt = system_prompt_point # system_prompt_bbox
user_prompt = "Click the 'Login' button" # replace with your instruction
image_path = '/path/to/your/local/image'
model_path = 'tencent/POINTS-GUI-G'
model = AutoModelForCausalLM.from_pretrained(model_path,
trust_remote_code=True,
dtype=torch.bfloat16,
device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
content = [
dict(type='image', image=image_path),
dict(type='text', text=user_prompt)
]
messages = [
{
'role': 'system',
'content': [dict(type='text', text=system_prompt)]
},
{
'role': 'user',
'content': content
}
]
generation_config = {
'max_new_tokens': 2048,
'do_sample': False
}
response = model.chat(
messages,
tokenizer,
image_processor,
generation_config
)
print(response)
```
## Citation
If you use this model in your work, please cite the following paper:
```
@article{zhao2026pointsguigguigroundingjourney,
title = {POINTS-GUI-G: GUI-Grounding Journey},
author = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie},
journal = {arXiv preprint arXiv:2602.06391},
year = {2026}
}
@inproceedings{liu2025points,
title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={1576--1601},
year={2025}
}
``` |