POINTS-GUI-G / README.md

nielsr HF Staff

Add library_name and pipeline_tag to metadata

626e819 verified 1 day ago

preview code

raw

history blame

6.35 kB

metadata

base_model:
  - Qwen/Qwen3-8B-Base
  - tencent/POINTS-Reader
  - WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
language:
  - en
  - zh
license: other
metrics:
  - accuracy
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - GUI
  - GUI-Grounding
  - Vision-language
  - multimodal

News

🔜 Upcoming: The End-to-End GUI Agent Model is currently under active development and will be released in a subsequent update. Stay tuned!
🚀 2026.02.06: We are happy to present POINTS-GUI-G, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our GitHub Repository.

Introduction

POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper POINTS-GUI-G: GUI-Grounding Journey.

State-of-the-Art Performance: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
Full-Stack Mastery: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
Refined Data Engineering: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.

Results

We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.

Getting Started

Run with Transformers

Please first install WePOINTS using the following command:

git clone https://github.com/WePOINTS/WePOINTS.git
cd ./WePOINTS
pip install -e .

from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
import torch

system_prompt_point = (
    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.

'
    'Requirements for the output:
'
    '- Return only the point (x, y) representing the center of the target element
'
    '- Coordinates must be normalized to the range [0, 1]
'
    '- Round each coordinate to three decimal places
'
    '- Format the output as strictly (x, y) without any additional text
'
)
system_prompt_bbox = (
    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.

'
    'Requirements for the output:
'
    '- Return only the bounding box coordinates (x0, y0, x1, y1)
'
    '- Coordinates must be normalized to the range [0, 1]
'
    '- Round each coordinate to three decimal places
'
    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.
'
)
system_prompt = system_prompt_point  # system_prompt_bbox
user_prompt = "Click the 'Login' button"  # replace with your instruction
image_path = '/path/to/your/local/image'
model_path = 'tencent/POINTS-GUI-G'
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             trust_remote_code=True,
                                             dtype=torch.bfloat16,
                                             device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
content = [
            dict(type='image', image=image_path),
            dict(type='text', text=user_prompt)
          ]
messages = [
        {
            'role': 'system',
            'content': [dict(type='text', text=system_prompt)]
        },
        {
            'role': 'user',
            'content': content
        }
    ]
generation_config = {
        'max_new_tokens': 2048,
        'do_sample': False
    }
response = model.chat(
    messages,
    tokenizer,
    image_processor,
    generation_config
)
print(response)

Citation

If you use this model in your work, please cite the following paper:

@article{zhao2026pointsguigguigroundingjourney,
  title   = {POINTS-GUI-G: GUI-Grounding Journey},
  author  = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie},
  journal = {arXiv preprint arXiv:2602.06391},
  year    = {2026}
}

@inproceedings{liu2025points,
  title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
  author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={1576--1601},
  year={2025}
}