tencent
/

POINTS-GUI-G

 - GUI
 - GUI-Grounding
 - Vision-language
+---
+<p align="center">
+    <img src="images/logo.png"/>
+<p>
+<p align="center">
+  <a href="https://huggingface.co/tencent/POINTS-GUI-G">
+    <img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
+  </a>
+  <a href="https://github.com/Tencent/POINTS-GUI">
+    <img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
+  </a>
+  <a href="coming soon">
+    <img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
+  </a>
+  <a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
+    <img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
+  </a>
+</p>
+## News
+- 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
+- 🚀 2026.02.06: We are pleased to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.
+## Introduction
+1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
+2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
+3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases
+## Results
+We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.
+![Example 1](images/results.png)
+## Examples
+### Prediction on desktop screenshots
+![Example 1](images/example_desktop_1.png)
+![Example 1](images/example_desktop_2.png)
+![Example 1](images/example_desktop_3.png)
+### Prediction on mobile screenshots
+![Example 1](images/example_mobile.png)
+### Prediction on web screenshots
+![Example 1](images/example_web_1.png)
+![Example 1](images/example_web_2.png)
+![Example 1](images/example_web_3.png)
+## Getting Started
+This following code snippet has been tested with following environment:
+```
+python==3.12.11
+torch==2.9.1
+transformers==4.57.1
+cuda==12.6
+```
+### Run with Transformers
+Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
+```sh
+git clone https://github.com/WePOINTS/WePOINTS.git
+cd ./WePOINTS
+pip install -e .
+```
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
+import torch
+system_prompt_point = (
+    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
+    'Requirements for the output:\n'
+    '- Return only the point (x, y) representing the center of the target element\n'
+    '- Coordinates must be normalized to the range [0, 1]\n'
+    '- Round each coordinate to three decimal places\n'
+    '- Format the output as strictly (x, y) without any additional text\n'
+)
+system_prompt_bbox = (
+    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
+    'Requirements for the output:\n'
+    '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
+    '- Coordinates must be normalized to the range [0, 1]\n'
+    '- Round each coordinate to three decimal places\n'
+    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
+)
+system_prompt = system_prompt_point  # system_prompt_bbox
+user_prompt = None  # replace with your instruction (e.g., 'close the window')
+image_path = '/path/to/your/local/image'
+model_path = 'tencent/POINTS-GUI-G'
+model = AutoModelForCausalLM.from_pretrained(model_path,
+                                             trust_remote_code=True,
+                                             dtype=torch.bfloat16,
+                                             device_map='cuda')
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
+content = [
+            dict(type='image', image=image_path),
+            dict(type='text', text=user_prompt)
+          ]
+messages = [
+        {
+            'role': 'system',
+            'content': [dict(type='text', text=system_prompt)]
+        },
+        {
+            'role': 'user',
+            'content': content
+        }
+    ]
+generation_config = {
+        'max_new_tokens': 2048,
+        'do_sample': False
+    }
+response = model.chat(
+    messages,
+    tokenizer,
+    image_processor,
+    generation_config
+)
+print(response)
+```
+### Deploy with SGLang
+We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR.
+#### How to Deploy
+You can deploy POINTS-GUI-G with SGLang using the following command:
+```
+python3 -m sglang.launch_server \
+--model-path tencent/POINTS-GUI-G \
+--tp-size 1 \
+--dp-size 1 \
+--chunked-prefill-size -1 \
+--mem-fraction-static 0.7 \
+--chat-template qwen2-vl \
+--trust-remote-code \
+--port 8081
+```
+#### How to Use
+You can use the following code to obtain results from SGLang:
+```python
+from typing import List
+import requests
+import json
+def call_wepoints(messages: List[dict],
+                 temperature: float = 0.0,
+                 max_new_tokens: int = 2048,
+                 repetition_penalty: float = 1.05,
+                 top_p: float = 0.8,
+                 top_k: int = 20,
+                 do_sample: bool = True,
+                 url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
+    """Query WePOINTS model to generate a response.
+    Args:
+        messages (List[dict]): A list of messages to be sent to WePOINTS. The
+            messages should be the standard OpenAI messages, like:
+            [
+                {
+                    'role': 'user',
+                    'content': [
+                        {
+                            'type': 'text',
+                            'text': 'Please describe this image in short'
+                        },
+                        {
+                            'type': 'image_url',
+                            'image_url': {'url': /path/to/image.jpg}
+                        }
+                    ]
+                }
+            ]
+        temperature (float, optional): The temperature of the model.
+            Defaults to 0.0.
+        max_new_tokens (int, optional): The maximum number of new tokens to generate.
+            Defaults to 2048.
+        repetition_penalty (float, optional): The penalty for repetition.
+            Defaults to 1.05.
+        top_p (float, optional): The top-p probability threshold.
+            Defaults to 0.8.
+        top_k (int, optional): The top-k sampling vocabulary size.
+            Defaults to 20.
+        do_sample (bool, optional): Whether to use sampling or greedy decoding.
+            Defaults to True.
+        url (str, optional): The URL of the WePOINTS model.
+            Defaults to 'http://127.0.0.1:8081/v1/chat/completions'.
+    Returns:
+        str: The generated response from WePOINTS.
+    """
+    data = {
+        'model': 'WePoints',
+        'messages': messages,
+        'max_new_tokens': max_new_tokens,
+        'temperature': temperature,
+        'repetition_penalty': repetition_penalty,
+        'top_p': top_p,
+        'top_k': top_k,
+        'do_sample': do_sample,
+    }
+    response = requests.post(url,
+                             json=data)
+    response = json.loads(response.text)
+    response = response['choices'][0]['message']['content']
+    return response
+system_prompt_point = (
+    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
+    'Requirements for the output:\n'
+    '- Return only the point (x, y) representing the center of the target element\n'
+    '- Coordinates must be normalized to the range [0, 1]\n'
+    '- Round each coordinate to three decimal places\n'
+    '- Format the output as strictly (x, y) without any additional text\n'
+)
+system_prompt_bbox = (
+    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
+    'Requirements for the output:\n'
+    '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
+    '- Coordinates must be normalized to the range [0, 1]\n'
+    '- Round each coordinate to three decimal places\n'
+    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
+)
+system_prompt = system_prompt_point  # system_prompt_bbox
+user_prompt = None  # replace with your instruction (e.g., 'close the window')
+messages = [
+            {
+              'role': 'system',
+              'content': [
+                  {
+                      'type': 'text',
+                      'text': system_prompt
+                  }
+              ]
+            },
+            {
+              'role': 'user',
+              'content': [
+                  {
+                      'type': 'image_url',
+                      'image_url': {'url': '/path/to/image.jpg'}
+                  },
+                  {
+                      'type': 'text',
+                      'text': user_prompt
+                  }
+              ]
+            }
+           ]
+response = call_wepoints(messages)
+print(response)
+```
+## Citation
+If you use this model in your work, please cite the following paper:
+```
+@inproceedings{liu2025points,
+  title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
+  author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
+  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
+  pages={1576--1601},
+  year={2025}
+}
+@article{liu2024points1,
+  title={POINTS1. 5: Building a Vision-Language Model towards Real World Applications},
+  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie},
+  journal={arXiv preprint arXiv:2412.08443},
+  year={2024}
+}
+@article{liu2024points,
+  title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
+  author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
+  journal={arXiv preprint arXiv:2409.04828},
+  year={2024}
+}
+@article{liu2024rethinking,
+  title={Rethinking Overlooked Aspects in Vision-Language Models},
+  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
+  journal={arXiv preprint arXiv:2405.11850},
+  year={2024}
+}
+```