feiziaarash commited on
Commit
c49e246
·
1 Parent(s): 1b032ad

fix readme

Browse files
Files changed (1) hide show
  1. README.md +213 -54
README.md CHANGED
@@ -1,34 +1,108 @@
1
  ---
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
 
 
4
  pipeline_tag: image-text-to-text
5
- metrics:
6
- - accuracy
7
  tags:
8
  - agent
 
 
 
 
 
9
  ---
10
 
11
- 🚀**Inference**
12
 
13
- Inference follows the same procedure as Qwen2.5-VL.
14
- At runtime, you must:
 
15
 
16
- 1. Prepend the system prompt above to your conversation.
17
 
18
- 2. Replace {width} and {height} with the true screenshot dimensions.
19
 
20
- 3. Parse <tool_call> tags in the model’s output to extract JSON tool calls.
 
 
 
 
21
 
22
- ```python
23
- import torch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
26
- from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize
 
 
 
 
 
 
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  from PIL import Image
29
 
30
- TEMP = 0.0
31
- GroundNext_GROUNDER_SYS_PROMPT = """You are a helpful assistant.
32
 
33
  # Tools
34
 
@@ -46,78 +120,163 @@ For each function call, return a json object with function name and arguments wi
46
 
47
  model_name = "ServiceNow/GroundNext-7B-V0"
48
 
49
- model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
50
- model_name,
51
- torch_dtype=torch.bfloat16,
52
- attn_implementation="flash_attention_2",
53
- device_map="auto",
54
- trust_remote_code=True
55
- ).eval()
 
56
 
57
  processor = AutoProcessor.from_pretrained(model_name)
58
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
59
 
60
-
61
- model.generation_config.temperature = TEMP
62
- model.generation_config.do_sample = False if TEMP == 0.0 else True
63
  model.generation_config.use_cache = True
64
 
 
65
  image_path = "./screenshot.png"
66
- instruction = "Click on the 'Save' icon"
67
-
68
-
69
- # inference
70
  image = Image.open(image_path).convert('RGB')
71
  width, height = image.size
 
 
72
  resized_height, resized_width = smart_resize(
73
  height,
74
  width,
75
  min_pixels=78_400,
76
  max_pixels=6_000_000,
77
  )
78
-
79
  image = image.resize((resized_width, resized_height))
80
 
81
- img_width, img_height = resized_width, resized_height
82
-
83
- full_prompt = f'{instruction}'
84
-
85
  messages = [
86
  {
87
- "role": "system",
88
- "content": GroundNext_GROUNDER_SYS_PROMPT.format(img_width=img_width, img_height=img_height)
89
  },
90
  {
91
  "role": "user",
92
  "content": [
93
- {
94
- "type": "image",
95
- "image": image,
96
- },
97
- {"type": "text", "text": full_prompt},
98
  ],
99
  }
100
  ]
101
 
102
- input_text = tokenizer.apply_chat_template(messages,
103
- add_generation_prompt=True,
104
- tokenize=False)
105
- inputs = processor(
106
- text=[input_text],
107
- images=[image],
108
- videos=None,
109
- padding=True,
110
- return_tensors="pt",
111
- ).to(model.device)
112
 
113
- generated_ids = model.generate(**inputs, max_new_tokens=64)
 
 
 
 
 
 
114
 
 
 
115
  generated_ids_trimmed = [
116
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
117
  ]
 
118
  response = processor.batch_decode(
119
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 
 
120
  )[0]
121
 
122
  print(response)
123
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ library_name: transformers
5
+ license: apache-2.0
6
  pipeline_tag: image-text-to-text
 
 
7
  tags:
8
  - agent
9
+ - computer-use
10
+ - gui-grounding
11
+ - vision-language
12
+ metrics:
13
+ - accuracy
14
  ---
15
 
16
+ # GroundNext-7B-V0
17
 
18
+ <p align="center">
19
+ &nbsp&nbsp🌐 <a href="https://groundcua.github.io">Website</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>&nbsp&nbsp
20
+ </p>
21
 
22
+ ## Highlights
23
 
24
+ **GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features:
25
 
26
+ - **Superior grounding accuracy** achieving 48.9% on ScreenSpot-Pro, 55.6% on OSWorld-G, and 31.3% on UI-Vision benchmarks
27
+ - **Exceptional cross-platform generalization** with 83.7% accuracy on MMBench-GUI and 92.8% on ScreenSpot-v2 despite desktop-only training
28
+ - **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
29
+ - **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models
30
+ - **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions
31
 
32
+ ![Performance Comparison](https://via.placeholder.com/800x400?text=GroundNext+Performance+Visualization)
33
+
34
+ ## Model Overview
35
+
36
+ **GroundNext-7B-V0** has the following characteristics:
37
+ - **Type**: Vision-Language Model for GUI Grounding
38
+ - **Base Model**: Qwen2.5-VL-7B-Instruct
39
+ - **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
40
+ - **Number of Parameters**: 7.0B
41
+ - **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset
42
+ - **Context Length**: 262,144 tokens (inherited from base model)
43
+ - **Specialization**: Desktop GUI element grounding with cross-platform generalization
44
+
45
+ For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io).
46
+
47
+ ## Performance
48
+
49
+ ### Desktop Grounding Benchmarks
50
+
51
+ | | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
52
+ |--- | --- | --- | --- |
53
+ | **ScreenSpot-Pro** | 27.6 | 38.1 | **48.9** |
54
+ | **OSWorld-G** | 31.4 | 57.1 | **55.6** |
55
+ | **UI-Vision** | 0.85 | 25.5 | **31.3** |
56
+ | **Avg (Desktop)** | 19.9 | 40.2 | **45.3** |
57
+
58
+ ### Cross-Platform Generalization (Mobile & Web)
59
+
60
+ | | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
61
+ |--- | --- | --- | --- |
62
+ | **MMBench-GUI** | 72.3 | 78.5 | **83.7** |
63
+ | **ScreenSpot-v2** | 88.8 | 90.3 | **92.8** |
64
+ | **Avg (Mobile/Web)** | 80.6 | 84.4 | **88.3** |
65
+
66
+ ### Agentic Performance on OSWorld
67
+
68
+ When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities:
69
 
70
+ | Model | OS | Office | Daily | Pro | Workflow | Overall |
71
+ |--- | --- | --- | --- | --- | --- | --- |
72
+ | OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
73
+ | CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
74
+ | OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
75
+ | UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
76
+ | JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** |
77
+ | **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 |
78
 
79
+ *Note: GroundNext-7B-V0 results with o3 integration forthcoming.*
80
+
81
+ ## Quickstart
82
+
83
+ The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation.
84
+
85
+ With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`.
86
+
87
+ ### Installation
88
+
89
+ ```bash
90
+ pip install transformers>=4.37.0 torch torchvision accelerate
91
+ pip install qwen-vl-utils # For image processing utilities
92
+ ```
93
+
94
+ ### Basic Inference
95
+
96
+ The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:
97
+
98
+ ```python
99
+ import torch
100
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
101
+ from qwen_vl_utils.vision_process import smart_resize
102
  from PIL import Image
103
 
104
+ # System prompt for computer use grounding
105
+ GROUNDNEXT_SYSTEM_PROMPT = """You are a helpful assistant.
106
 
107
  # Tools
108
 
 
120
 
121
  model_name = "ServiceNow/GroundNext-7B-V0"
122
 
123
+ # Load model and processor
124
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
125
+ model_name,
126
+ torch_dtype=torch.bfloat16,
127
+ attn_implementation="flash_attention_2",
128
+ device_map="auto",
129
+ trust_remote_code=True
130
+ ).eval()
131
 
132
  processor = AutoProcessor.from_pretrained(model_name)
133
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
134
 
135
+ # Configure generation
136
+ model.generation_config.temperature = 0.0
137
+ model.generation_config.do_sample = False
138
  model.generation_config.use_cache = True
139
 
140
+ # Load and prepare image
141
  image_path = "./screenshot.png"
 
 
 
 
142
  image = Image.open(image_path).convert('RGB')
143
  width, height = image.size
144
+
145
+ # Resize image using smart_resize
146
  resized_height, resized_width = smart_resize(
147
  height,
148
  width,
149
  min_pixels=78_400,
150
  max_pixels=6_000_000,
151
  )
 
152
  image = image.resize((resized_width, resized_height))
153
 
154
+ # Create messages
155
+ instruction = "Click on the 'Save' icon"
 
 
156
  messages = [
157
  {
158
+ "role": "system",
159
+ "content": GROUNDNEXT_SYSTEM_PROMPT.format(width=resized_width, height=resized_height)
160
  },
161
  {
162
  "role": "user",
163
  "content": [
164
+ {"type": "image", "image": image},
165
+ {"type": "text", "text": instruction},
 
 
 
166
  ],
167
  }
168
  ]
169
 
170
+ # Prepare inputs
171
+ input_text = tokenizer.apply_chat_template(
172
+ messages,
173
+ add_generation_prompt=True,
174
+ tokenize=False
175
+ )
 
 
 
 
176
 
177
+ inputs = processor(
178
+ text=[input_text],
179
+ images=[image],
180
+ videos=None,
181
+ padding=True,
182
+ return_tensors="pt",
183
+ ).to(model.device)
184
 
185
+ # Generate response
186
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
187
  generated_ids_trimmed = [
188
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
189
  ]
190
+
191
  response = processor.batch_decode(
192
+ generated_ids_trimmed,
193
+ skip_special_tokens=True,
194
+ clean_up_tokenization_spaces=False
195
  )[0]
196
 
197
  print(response)
198
+ # Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
199
+ ```
200
+
201
+ ### Deployment with vLLM
202
+
203
+ For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:
204
+
205
+ **vLLM**:
206
+ ```bash
207
+ vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
208
+ ```
209
+
210
+ **Note**: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.
211
+
212
+ ## Best Practices
213
+
214
+ To achieve optimal grounding performance, we recommend:
215
+
216
+ 1. **Image Preprocessing**:
217
+ - Use high-resolution screenshots (minimum 800x600)
218
+ - Ensure UI elements are clearly visible
219
+ - Maintain original aspect ratios when resizing
220
+
221
+ 2. **Prompt Engineering**:
222
+ - Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner")
223
+ - Include element attributes when available (color, position, text)
224
+ - Use consistent terminology matching the UI
225
+
226
+ 3. **Generation Parameters**:
227
+ - Use `temperature=0.0` for deterministic grounding
228
+ - Set `max_new_tokens=128` (sufficient for tool calls)
229
+ - Enable `use_cache=True` for faster inference
230
+
231
+ 4. **System Prompt**:
232
+ - Always include the system prompt with actual screen dimensions
233
+ - Replace `{width}` and `{height}` with true screenshot dimensions
234
+ - Maintain the tool signature format for proper JSON parsing
235
+
236
+ 5. **Post-processing**:
237
+ - Parse `<tool_call>` tags to extract JSON
238
+ - Validate coordinates are within screen bounds
239
+ - Handle cases where model may describe element instead of providing coordinates
240
+
241
+ ## Training
242
+
243
+ GroundNext-7B-V0 was trained using a two-stage approach:
244
+
245
+ 1. **Supervised Fine-tuning (SFT)**: Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
246
+ 2. **Reinforcement Learning (RLOO)**: Further optimized using reward-based learning with custom GUI grounding rewards
247
+
248
+ For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA).
249
+
250
+ ## Limitations and Future Work
251
+
252
+ - **Desktop-focused**: Primarily trained on desktop environments (though shows strong cross-platform generalization)
253
+ - **Action space**: Currently supports mouse and keyboard actions; additional modalities under exploration
254
+ - **Languages**: Optimized for English UI elements; multilingual support in development
255
+ - **Resolution**: Performance may vary with extremely high or low resolution images
256
+
257
+ ## Citation
258
+
259
+ If you use GroundNext-7B-V0 in your research, please cite:
260
+
261
+ ```bibtex
262
+ @misc{feizi2025groundingcomputeruseagents,
263
+ title={Grounding Computer Use Agents on Human Demonstrations},
264
+ author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
265
+ year={2025},
266
+ eprint={2511.07332},
267
+ archivePrefix={arXiv},
268
+ primaryClass={cs.LG},
269
+ url={https://arxiv.org/abs/2511.07332},
270
+ }
271
+ ```
272
+
273
+ ## License
274
+
275
+ This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/blob/main/LICENSE) for details.
276
+
277
+ ## Acknowledgements
278
+
279
+ We thank:
280
+ - The Qwen team for the excellent Qwen2.5-VL foundation models
281
+ - The open-source community for tools and frameworks that made this work possible
282
+ - Human annotators who contributed to the GroundCUA dataset