--- base_model: - Qwen/Qwen3-8B-Base - tencent/POINTS-Reader - WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat language: - en - zh license: other metrics: - accuracy library_name: transformers pipeline_tag: image-text-to-text tags: - GUI - GUI-Grounding - Vision-language - multimodal ---

HuggingFace GitHub Code Paper view

## News - 🔜 Upcoming: The End-to-End GUI Agent Model is currently under active development and will be released in a subsequent update. Stay tuned! - 🚀 2026.02.06: We are happy to present POINTS-GUI-G, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our GitHub Repository. ## Introduction POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391). 1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. 2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization. 3. **Refined Data Engineering**: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases. ## Results We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines. ![Example 1](images/results.png) ## Getting Started ### Run with Transformers Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command: ```sh git clone https://github.com/WePOINTS/WePOINTS.git cd ./WePOINTS pip install -e . ``` ```python from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor import torch system_prompt_point = ( 'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user. ' 'Requirements for the output: ' '- Return only the point (x, y) representing the center of the target element ' '- Coordinates must be normalized to the range [0, 1] ' '- Round each coordinate to three decimal places ' '- Format the output as strictly (x, y) without any additional text ' ) system_prompt_bbox = ( 'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user. ' 'Requirements for the output: ' '- Return only the bounding box coordinates (x0, y0, x1, y1) ' '- Coordinates must be normalized to the range [0, 1] ' '- Round each coordinate to three decimal places ' '- Format the output as strictly (x0, y0, x1, y1) without any additional text. ' ) system_prompt = system_prompt_point # system_prompt_bbox user_prompt = "Click the 'Login' button" # replace with your instruction image_path = '/path/to/your/local/image' model_path = 'tencent/POINTS-GUI-G' model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, dtype=torch.bfloat16, device_map='cuda') tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) image_processor = Qwen2VLImageProcessor.from_pretrained(model_path) content = [ dict(type='image', image=image_path), dict(type='text', text=user_prompt) ] messages = [ { 'role': 'system', 'content': [dict(type='text', text=system_prompt)] }, { 'role': 'user', 'content': content } ] generation_config = { 'max_new_tokens': 2048, 'do_sample': False } response = model.chat( messages, tokenizer, image_processor, generation_config ) print(response) ``` ## Citation If you use this model in your work, please cite the following paper: ``` @article{zhao2026pointsguigguigroundingjourney, title = {POINTS-GUI-G: GUI-Grounding Journey}, author = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie}, journal = {arXiv preprint arXiv:2602.06391}, year = {2026} } @inproceedings{liu2025points, title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion}, author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others}, booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, pages={1576--1601}, year={2025} } ```