Minstrel54524 commited on
Commit
6ad8bbe
·
verified ·
1 Parent(s): fc56e98

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -2
README.md CHANGED
@@ -1,9 +1,106 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  datasets:
4
- - likaixin/ScreenSpot-Pro
 
5
  language:
6
  - en
7
  base_model:
8
  - Qwen/Qwen2.5-VL-7B-Instruct
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - computer-vision
5
+ - graphical-user-interface
6
+ - ui-automation
7
+ - gui-grounding
8
  datasets:
9
+ - ScreenSpot-v2
10
+ - ScreenSpot-Pro
11
  language:
12
  - en
13
  base_model:
14
  - Qwen/Qwen2.5-VL-7B-Instruct
15
+ ---
16
+
17
+ ## Model Card for V2P: Valley-to-Peak GUI Grounding Model
18
+
19
+ ### Model Details
20
+
21
+ * **Model Name:** V2P (Valley-to-Peak)
22
+ * **Version:** 1.0
23
+ * **Model Type:** GUI Grounding / UI Element Localization
24
+ * **Developers:** Jikai Chen, Long Chen, Dong Wang, Zhixuan Chu, Qinglin Su, Leilei Gan, Chenyi Zhuang, Jinjie Gu
25
+ * **Paper:** [V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task](https://arxiv.org/abs/2508.13634)
26
+ * **Repository:** [Github](https://github.com/inclusionAI/AgenticLearning/tree/main/V2P)
27
+
28
+ ### Model Description
29
+
30
+ **V2P (Valley-to-Peak)** is an advanced model designed for robust and precise Graphical User Interface (GUI) element localization (grounding). In the field of GUI automation agents, accurately identifying interactive elements on a screen is critical. Traditional methods like bounding box regression or center-point prediction often overlook the spatial uncertainty of interaction and the hierarchical visual-semantic relationships, leading to insufficient localization accuracy.
31
+
32
+ The V2P model was developed to address two major pain points in existing methods:
33
+ 1. **Attention Drift due to Background Interference:** The model's attention mistakenly disperses to irrelevant background areas.
34
+ 2. **Imprecise Click Locations:** The model fails to distinguish between the center and the edges of a target element, leading to interaction failures.
35
+
36
+ Inspired by human visual processing and interaction with GUIs, V2P introduces two innovative core mechanisms:
37
+
38
+ * **"Valley" Background Suppression:** V2P employs a novel suppressive attention mechanism that actively minimizes the model's focus on irrelevant background regions. This significantly "pushes down" the weight of the background (forming a valley), which in turn "highlights" the target area (forming a peak), fundamentally solving the issue of attention drift.
39
+
40
+ * **"Peak" Center Focusing:** Drawing inspiration from the classic Fitts' Law, V2P models the GUI interaction process as a 2D Gaussian heatmap. The model is trained to predict a weight distribution that peaks at the center of the target element and gradually decays towards the edges. This design enables the model to learn to focus on the most central and suitable area for clicking, rather than the entire ambiguous region, thereby dramatically improving click precision.
41
+
42
+ ### Intended Use
43
+
44
+ * **Primary Use Case:** This model is primarily intended for GUI automation agents, providing them with precise visual localization capabilities. It can take a visual interface (screenshot) and an instruction (e.g., "click the 'login' button") as input and output the most suitable interaction coordinates for the target element.
45
+ * **Target Applications:**
46
+ * Automated software testing
47
+ * Robotic Process Automation (RPA)
48
+ * Assistive technology development (e.g., UI interaction tools for people with disabilities)
49
+ * UI/UX design analysis
50
+
51
+ ### How to Use
52
+
53
+ ```python
54
+ import torch
55
+ from PIL import Image
56
+ import requests
57
+ from transformers import AutoModelForCausalLM, AutoProcessor
58
+
59
+ # Set device
60
+ device = "cuda" if torch.cuda.is_available() else "cpu"
61
+
62
+ # Load model and processor from Hugging Face Hub
63
+ model_id = "inclusionAI/V2P-7B"
64
+ model = AutoModelForCausalLM.from_pretrained(
65
+ model_id,
66
+ torch_dtype=torch.float16, # Use bfloat16 for better performance if available
67
+ device_map="auto"
68
+ )
69
+ processor = AutoProcessor.from_pretrained(model_id)
70
+
71
+ # Prepare the inputs: an image and a text prompt
72
+ # Example: Using an online image of a user interface
73
+ url = "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bee.JPG)"
74
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
75
+ prompt_text = "Find the location of the 'Settings' button on this screen."
76
+
77
+ # Format the prompt for the model
78
+ messages = [
79
+ {
80
+ "role": "user",
81
+ "content": [
82
+ {"type": "image", "image_url": url}, # You can use image_url or a local PIL Image
83
+ {"type": "text", "text": prompt_text}
84
+ ]
85
+ }
86
+ ]
87
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
88
+ inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device)
89
+
90
+ # Generate the response
91
+ generated_ids = model.generate(
92
+ **inputs,
93
+ max_new_tokens=1024,
94
+ do_sample=False,
95
+ temperature=0,
96
+ top_p=None,
97
+ )
98
+
99
+ # Decode and print the output
100
+ # We need to slice the generated IDs to exclude the prompt tokens
101
+ input_token_len = inputs['input_ids'].shape[1]
102
+ output_ids = generated_ids[0][input_token_len:]
103
+ output_text = processor.decode(output_ids, skip_special_tokens=True)
104
+
105
+ print(output_text)
106
+ # For more visualization code, please refer to the code in the V2P GitHub repository.