Add library_name and pipeline_tag to metadata

Hi! I'm Niels from the community science team at Hugging Face.

This PR improves the metadata of your model card by adding the `library_name: transformers` and the `image-text-to-text` pipeline tag. These additions will enable the "Use in Transformers" button on the model page and help users find the model more easily through task-based filtering.

I've also kept the existing usage examples and benchmark results from your README.

Files changed (1) hide show

README.md +38 -209

README.md CHANGED Viewed

@@ -1,14 +1,16 @@
 ---
-license: other
 language:
 - en
 - zh
 metrics:
 - accuracy
-base_model:
-- Qwen/Qwen3-8B-Base
-- tencent/POINTS-Reader
-- WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
 tags:
 - GUI
 - GUI-Grounding
@@ -42,11 +44,13 @@ tags:
 ## Introduction
 1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
-2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5 (which initially lacked native grounding ability). We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
-3. **Refined Data Engineering**: Existing GUI datasets differ in coordinate systems, task formats, and contain substantial noise. We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases
 ## Results
@@ -54,35 +58,8 @@ We evaluate POINTS-GUI-G-8B on four widely used GUI grounding benchmarks: Screen
 ![Example 1](images/results.png)
-## Examples
-### Prediction on desktop screenshots
-![Example 1](images/example_desktop_1.png)
-![Example 1](images/example_desktop_2.png)
-![Example 1](images/example_desktop_3.png)
-### Prediction on mobile screenshots
-![Example 1](images/example_mobile.png)
-### Prediction on web screenshots
-![Example 1](images/example_web_1.png)
-![Example 1](images/example_web_2.png)
-![Example 1](images/example_web_3.png)
 ## Getting Started
-This following code snippet has been tested with following environment:
-```
-python==3.12.11
-torch==2.9.1
-transformers==4.57.1
-cuda==12.6
-```
 ### Run with Transformers
 Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
@@ -98,23 +75,37 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProces
 import torch
 system_prompt_point = (
-    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
-    'Requirements for the output:\n'
-    '- Return only the point (x, y) representing the center of the target element\n'
-    '- Coordinates must be normalized to the range [0, 1]\n'
-    '- Round each coordinate to three decimal places\n'
-    '- Format the output as strictly (x, y) without any additional text\n'
 )
 system_prompt_bbox = (
-    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
-    'Requirements for the output:\n'
-    '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
-    '- Coordinates must be normalized to the range [0, 1]\n'
-    '- Round each coordinate to three decimal places\n'
-    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
 )
 system_prompt = system_prompt_point  # system_prompt_bbox
-user_prompt = None  # replace with your instruction (e.g., 'close the window')
 image_path = '/path/to/your/local/image'
 model_path = 'tencent/POINTS-GUI-G'
 model = AutoModelForCausalLM.from_pretrained(model_path,
@@ -150,147 +141,6 @@ response = model.chat(
 print(response)
 ```
-### Deploy with SGLang
-We have created a [Pull Request](https://github.com/sgl-project/sglang/pull/17989) for SGLang. You can check out this branch and install SGLang in editable mode by following the [official guide](https://docs.sglang.ai/get_started/install.html) prior to the merging of this PR.
-#### How to Deploy
-You can deploy POINTS-GUI-G with SGLang using the following command:
-```
-python3 -m sglang.launch_server \
---model-path tencent/POINTS-GUI-G \
---tp-size 1 \
---dp-size 1 \
---chunked-prefill-size -1 \
---mem-fraction-static 0.7 \
---chat-template qwen2-vl \
---trust-remote-code \
---port 8081
-```
-#### How to Use
-You can use the following code to obtain results from SGLang:
-```python
-from typing import List
-import requests
-import json
-def call_wepoints(messages: List[dict],
-                 temperature: float = 0.0,
-                 max_new_tokens: int = 2048,
-                 repetition_penalty: float = 1.05,
-                 top_p: float = 0.8,
-                 top_k: int = 20,
-                 do_sample: bool = True,
-                 url: str = 'http://127.0.0.1:8081/v1/chat/completions') -> str:
-    """Query WePOINTS model to generate a response.
-    Args:
-        messages (List[dict]): A list of messages to be sent to WePOINTS. The
-            messages should be the standard OpenAI messages, like:
-            [
-                {
-                    'role': 'user',
-                    'content': [
-                        {
-                            'type': 'text',
-                            'text': 'Please describe this image in short'
-                        },
-                        {
-                            'type': 'image_url',
-                            'image_url': {'url': /path/to/image.jpg}
-                        }
-                    ]
-                }
-            ]
-        temperature (float, optional): The temperature of the model.
-            Defaults to 0.0.
-        max_new_tokens (int, optional): The maximum number of new tokens to generate.
-            Defaults to 2048.
-        repetition_penalty (float, optional): The penalty for repetition.
-            Defaults to 1.05.
-        top_p (float, optional): The top-p probability threshold.
-            Defaults to 0.8.
-        top_k (int, optional): The top-k sampling vocabulary size.
-            Defaults to 20.
-        do_sample (bool, optional): Whether to use sampling or greedy decoding.
-            Defaults to True.
-        url (str, optional): The URL of the WePOINTS model.
-            Defaults to 'http://127.0.0.1:8081/v1/chat/completions'.
-    Returns:
-        str: The generated response from WePOINTS.
-    """
-    data = {
-        'model': 'WePoints',
-        'messages': messages,
-        'max_new_tokens': max_new_tokens,
-        'temperature': temperature,
-        'repetition_penalty': repetition_penalty,
-        'top_p': top_p,
-        'top_k': top_k,
-        'do_sample': do_sample,
-    }
-    response = requests.post(url,
-                             json=data)
-    response = json.loads(response.text)
-    response = response['choices'][0]['message']['content']
-    return response
-system_prompt_point = (
-    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.\n\n'
-    'Requirements for the output:\n'
-    '- Return only the point (x, y) representing the center of the target element\n'
-    '- Coordinates must be normalized to the range [0, 1]\n'
-    '- Round each coordinate to three decimal places\n'
-    '- Format the output as strictly (x, y) without any additional text\n'
-)
-system_prompt_bbox = (
-    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.\n\n'
-    'Requirements for the output:\n'
-    '- Return only the bounding box coordinates (x0, y0, x1, y1)\n'
-    '- Coordinates must be normalized to the range [0, 1]\n'
-    '- Round each coordinate to three decimal places\n'
-    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.\n'
-)
-system_prompt = system_prompt_point  # system_prompt_bbox
-user_prompt = None  # replace with your instruction (e.g., 'close the window')
-messages = [
-            {
-              'role': 'system',
-              'content': [
-                  {
-                      'type': 'text',
-                      'text': system_prompt
-                  }
-              ]
-            },
-            {
-              'role': 'user',
-              'content': [
-                  {
-                      'type': 'image_url',
-                      'image_url': {'url': '/path/to/image.jpg'}
-                  },
-                  {
-                      'type': 'text',
-                      'text': user_prompt
-                  }
-              ]
-            }
-           ]
-response = call_wepoints(messages)
-print(response)
-```
 ## Citation
 If you use this model in your work, please cite the following paper:
@@ -310,25 +160,4 @@ If you use this model in your work, please cite the following paper:
   pages={1576--1601},
   year={2025}
 }
-@article{liu2024points1,
-  title={POINTS1. 5: Building a Vision-Language Model towards Real World Applications},
-  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Gao, Xinyu and Yu, Kavio and Yu, Yang and Zhou, Jie},
-  journal={arXiv preprint arXiv:2412.08443},
-  year={2024}
-}
-@article{liu2024points,
-  title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
-  author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
-  journal={arXiv preprint arXiv:2409.04828},
-  year={2024}
-}
-@article{liu2024rethinking,
-  title={Rethinking Overlooked Aspects in Vision-Language Models},
-  author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
-  journal={arXiv preprint arXiv:2405.11850},
-  year={2024}
-}
 ```

 ---
+base_model:
+- Qwen/Qwen3-8B-Base
+- tencent/POINTS-Reader
+- WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat
 language:
 - en
 - zh
+license: other
 metrics:
 - accuracy
+library_name: transformers
+pipeline_tag: image-text-to-text
 tags:
 - GUI
 - GUI-Grounding
 ## Introduction
+POINTS-GUI-G-8B is a specialized GUI Grounding model introduced in the paper [POINTS-GUI-G: GUI-Grounding Journey](https://huggingface.co/papers/2602.06391).
 1. **State-of-the-Art Performance**: POINTS-GUI-G-8B achieves leading results on multiple GUI grounding benchmarks, with 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.
+2. **Full-Stack Mastery**: Unlike many current GUI agents that build upon models already possessing strong grounding capabilities (e.g., Qwen3-VL), POINTS-GUI-G-8B is developed from the ground up using POINTS-1.5. We have mastered the complete technical pipeline, proving that a specialized GUI specialist can be built from a general-purpose base model through targeted optimization.
+3. **Refined Data Engineering**: We build a unified data pipeline that (1) standardizes all coordinates to a [0, 1] range and reformats heterogeneous tasks into a single “locate UI element” formulation, (2) automatically filters noisy or incorrect annotations, and (3) explicitly increases difficulty via layout-based filtering and synthetic hard cases.
 ## Results
 ![Example 1](images/results.png)
 ## Getting Started
 ### Run with Transformers
 Please first install [WePOINTS](https://github.com/WePOINTS/WePOINTS) using the following command:
 import torch
 system_prompt_point = (
+    'You are a GUI agent. Based on the UI screenshot provided, please locate the exact position of the element that matches the instruction given by the user.
+'
+    'Requirements for the output:
+'
+    '- Return only the point (x, y) representing the center of the target element
+'
+    '- Coordinates must be normalized to the range [0, 1]
+'
+    '- Round each coordinate to three decimal places
+'
+    '- Format the output as strictly (x, y) without any additional text
+'
 )
 system_prompt_bbox = (
+    'You are a GUI agent. Based on the UI screenshot provided, please output the bounding box of the element that matches the instruction given by the user.
+'
+    'Requirements for the output:
+'
+    '- Return only the bounding box coordinates (x0, y0, x1, y1)
+'
+    '- Coordinates must be normalized to the range [0, 1]
+'
+    '- Round each coordinate to three decimal places
+'
+    '- Format the output as strictly (x0, y0, x1, y1) without any additional text.
+'
 )
 system_prompt = system_prompt_point  # system_prompt_bbox
+user_prompt = "Click the 'Login' button"  # replace with your instruction
 image_path = '/path/to/your/local/image'
 model_path = 'tencent/POINTS-GUI-G'
 model = AutoModelForCausalLM.from_pretrained(model_path,
 print(response)
 ```
 ## Citation
 If you use this model in your work, please cite the following paper:
   pages={1576--1601},
   year={2025}
 }
 ```