inclusionAI
/

V2P-7B

 ---
 license: apache-2.0
+tags:
+- computer-vision
+- graphical-user-interface
+- ui-automation
+- gui-grounding
 datasets:
+- ScreenSpot-v2
+- ScreenSpot-Pro
 language:
 - en
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+---
+## Model Card for V2P: Valley-to-Peak GUI Grounding Model
+### Model Details
+* **Model Name:** V2P (Valley-to-Peak)
+* **Version:** 1.0
+* **Model Type:** GUI Grounding / UI Element Localization
+* **Developers:** Jikai Chen, Long Chen, Dong Wang, Zhixuan Chu, Qinglin Su, Leilei Gan, Chenyi Zhuang, Jinjie Gu
+* **Paper:** [V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task](https://arxiv.org/abs/2508.13634)
+* **Repository:** [Github](https://github.com/inclusionAI/AgenticLearning/tree/main/V2P)
+### Model Description
+**V2P (Valley-to-Peak)** is an advanced model designed for robust and precise Graphical User Interface (GUI) element localization (grounding). In the field of GUI automation agents, accurately identifying interactive elements on a screen is critical. Traditional methods like bounding box regression or center-point prediction often overlook the spatial uncertainty of interaction and the hierarchical visual-semantic relationships, leading to insufficient localization accuracy.
+The V2P model was developed to address two major pain points in existing methods:
+1.  **Attention Drift due to Background Interference:** The model's attention mistakenly disperses to irrelevant background areas.
+2.  **Imprecise Click Locations:** The model fails to distinguish between the center and the edges of a target element, leading to interaction failures.
+Inspired by human visual processing and interaction with GUIs, V2P introduces two innovative core mechanisms:
+* **"Valley" Background Suppression:** V2P employs a novel suppressive attention mechanism that actively minimizes the model's focus on irrelevant background regions. This significantly "pushes down" the weight of the background (forming a valley), which in turn "highlights" the target area (forming a peak), fundamentally solving the issue of attention drift.
+* **"Peak" Center Focusing:** Drawing inspiration from the classic Fitts' Law, V2P models the GUI interaction process as a 2D Gaussian heatmap. The model is trained to predict a weight distribution that peaks at the center of the target element and gradually decays towards the edges. This design enables the model to learn to focus on the most central and suitable area for clicking, rather than the entire ambiguous region, thereby dramatically improving click precision.
+### Intended Use
+* **Primary Use Case:** This model is primarily intended for GUI automation agents, providing them with precise visual localization capabilities. It can take a visual interface (screenshot) and an instruction (e.g., "click the 'login' button") as input and output the most suitable interaction coordinates for the target element.
+* **Target Applications:**
+    * Automated software testing
+    * Robotic Process Automation (RPA)
+    * Assistive technology development (e.g., UI interaction tools for people with disabilities)
+    * UI/UX design analysis
+### How to Use
+```python
+import torch
+from PIL import Image
+import requests
+from transformers import AutoModelForCausalLM, AutoProcessor
+# Set device
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Load model and processor from Hugging Face Hub
+model_id = "inclusionAI/V2P-7B"
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16, # Use bfloat16 for better performance if available
+    device_map="auto"
+)
+processor = AutoProcessor.from_pretrained(model_id)
+# Prepare the inputs: an image and a text prompt
+# Example: Using an online image of a user interface
+url = "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG)"
+image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+prompt_text = "Find the location of the 'Settings' button on this screen."
+# Format the prompt for the model
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image_url": url}, # You can use image_url or a local PIL Image
+            {"type": "text", "text": prompt_text}
+        ]
+    }
+]
+prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
+# Generate the response
+generated_ids = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+    do_sample=False,
+    temperature=0,
+    top_p=None,
+)
+# Decode and print the output
+# We need to slice the generated IDs to exclude the prompt tokens
+input_token_len = inputs['input_ids'].shape[1]
+output_ids = generated_ids[0][input_token_len:]
+output_text = processor.decode(output_ids, skip_special_tokens=True)
+print(output_text)
+# For more visualization code, please refer to the code in the V2P GitHub repository.