POINTS-GUI-G / README.md

Add library_name and pipeline_tag to metadata

626e819 verified 3 days ago

6.35 kB

	---
	base_model:
	- Qwen/Qwen3-8B-Base
	- tencent/POINTS-Reader
	- WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
	language:
	- en
	- zh
	license: other
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- GUI
	- GUI-Grounding
	- Vision-language
	- multimodal
	---

	<p align="center">
	<img src="images/logo.png"/>
	<p>

	<p align="center">
	<a href="https://huggingface.co/tencent/POINTS-GUI-G">
	<img src="https://img.shields.io/badge/%F0%9F%A4%97_HuggingFace-Model-ffbd45.svg" alt="HuggingFace">
	</a>
	<a href="https://github.com/Tencent/POINTS-GUI">
	<img src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&" alt="GitHub Code">
	</a>
	<a href="https://huggingface.co/papers/2602.06391">
	<img src="https://img.shields.io/badge/Paper-POINTS--GUI--G-d4333f?logo=arxiv&logoColor=white&colorA=cccccc&colorB=d4333f&style=flat" alt="Paper">
	</a>
	<a href="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
	<img src="https://komarev.com/ghpvc/?username=tencent&repo=POINTS-GUI&color=brightgreen&label=Views" alt="view">
	</a>
	</p>

	## News

	- 🔜 <b>Upcoming:</b> The <b>End-to-End GUI Agent Model</b> is currently under active development and will be released in a subsequent update. Stay tuned!
	- 🚀 2026.02.06: We are happy to present <b>POINTS-GUI-G</b>, our specialized GUI Grounding Model. To facilitate reproducible evaluation, we provide comprehensive scripts and guidelines in our <a href="https://github.com/Tencent/POINTS-GUI/tree/main/evaluation">GitHub Repository</a>.

	## Introduction

	POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391).

	1. State-of-the-Art Performance: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.

	2. Full-Stack Mastery: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.

	3. Refined Data Engineering: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.

	## Results

	We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, and UI-Vision. The figure below summarizes our results compared with existing open-source and proprietary baselines.

	![Example 1](images/results.png)

	## Getting Started

	### Run with Transformers

	Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:

	```sh
	git clone https://github.com/WePOINTS/WePOINTS.git
	cd ./WePOINTS
	pip install -e .
	```

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor
	import torch

	system_prompt_point = (
	'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.

	'
	'Requirements for the output:
	'
	'- Return only the point (x, y) representing the center of the target element
	'
	'- Coordinates must be normalized to the range [0, 1]
	'
	'- Round each coordinate to three decimal places
	'
	'- Format the output as strictly (x, y) without any additional text
	'
	)
	system_prompt_bbox = (
	'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.

	'
	'Requirements for the output:
	'
	'- Return only the bounding box coordinates (x0, y0, x1, y1)
	'
	'- Coordinates must be normalized to the range [0, 1]
	'
	'- Round each coordinate to three decimal places
	'
	'- Format the output as strictly (x0, y0, x1, y1) without any additional text.
	'
	)
	system_prompt = system_prompt_point # system_prompt_bbox
	user_prompt = "Click the 'Login' button" # replace with your instruction
	image_path = '/path/to/your/local/image'
	model_path = 'tencent/POINTS-GUI-G'
	model = AutoModelForCausalLM.from_pretrained(model_path,
	trust_remote_code=True,
	dtype=torch.bfloat16,
	device_map='cuda')
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	image_processor = Qwen2VLImageProcessor.from_pretrained(model_path)
	content = [
	dict(type='image', image=image_path),
	dict(type='text', text=user_prompt)
	]
	messages = [
	{
	'role': 'system',
	'content': [dict(type='text', text=system_prompt)]
	},
	{
	'role': 'user',
	'content': content
	}
	]
	generation_config = {
	'max_new_tokens': 2048,
	'do_sample': False
	}
	response = model.chat(
	messages,
	tokenizer,
	image_processor,
	generation_config
	)
	print(response)
	```

	## Citation

	If you use this model in your work, please cite the following paper:

	```
	@article{zhao2026pointsguigguigroundingjourney,
	title = {POINTS-GUI-G: GUI-Grounding Journey},
	author = {Zhao, Zhongyin and Liu, Yuan and Liu, Yikun and Wang, Haicheng and Tian, Le and Zhou, Xiao and You, Yangxiu and Yu, Zilin and Yu, Yang and Zhou, Jie},
	journal = {arXiv preprint arXiv:2602.06391},
	year = {2026}
	}

	@inproceedings{liu2025points,
	title={POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion},
	author={Liu, Yuan and Zhao, Zhongyin and Tian, Le and Wang, Haicheng and Ye, Xubing and You, Yangxiu and Yu, Zilin and Wu, Chuhan and Xiao, Zhou and Yu, Yang and others},
	booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
	pages={1576--1601},
	year={2025}
	}
	```