ServiceNow
/

GroundNext-7B-V0

@@ -1,34 +1,108 @@
 ---
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
 pipeline_tag: image-text-to-text
-metrics:
-- accuracy
 tags:
 - agent
 ---
-🚀**Inference**
-Inference follows the same procedure as Qwen2.5-VL.
-At runtime, you must:
-1. Prepend the system prompt above to your conversation.
-2. Replace {width} and {height} with the true screenshot dimensions.
-3. Parse <tool_call> tags in the model’s output to extract JSON tool calls.
-```python
-import torch
-from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
-from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize
 from PIL import Image
-TEMP = 0.0
-GroundNext_GROUNDER_SYS_PROMPT = """You are a helpful assistant.
 # Tools
@@ -46,78 +120,163 @@ For each function call, return a json object with function name and arguments wi
 model_name = "ServiceNow/GroundNext-7B-V0"
-model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-            model_name,
-            torch_dtype=torch.bfloat16,
-            attn_implementation="flash_attention_2",
-            device_map="auto",
-            trust_remote_code=True
-        ).eval()
 processor = AutoProcessor.from_pretrained(model_name)
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-model.generation_config.temperature = TEMP
-model.generation_config.do_sample = False if TEMP == 0.0 else True
 model.generation_config.use_cache = True
 image_path = "./screenshot.png"
-instruction = "Click on the 'Save' icon"
-# inference
 image = Image.open(image_path).convert('RGB')
 width, height = image.size
 resized_height, resized_width = smart_resize(
     height,
     width,
     min_pixels=78_400,
     max_pixels=6_000_000,
 )
 image = image.resize((resized_width, resized_height))
-img_width, img_height = resized_width, resized_height
-full_prompt = f'{instruction}'
 messages = [
     {
-    "role": "system",
-    "content": GroundNext_GROUNDER_SYS_PROMPT.format(img_width=img_width, img_height=img_height)
     },
     {
         "role": "user",
         "content": [
-            {
-                "type": "image",
-                "image": image,
-            },
-            {"type": "text", "text": full_prompt},
         ],
     }
 ]
-input_text = tokenizer.apply_chat_template(messages,
-                                           add_generation_prompt=True,
-                                           tokenize=False)
-inputs = processor(
-                text=[input_text],
-                images=[image],
-                videos=None,
-                padding=True,
-                return_tensors="pt",
-            ).to(model.device)
-generated_ids = model.generate(**inputs, max_new_tokens=64)
 generated_ids_trimmed = [
-    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
 response = processor.batch_decode(
-    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
 )[0]
 print(response)
-```

 ---
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+library_name: transformers
+license: apache-2.0
 pipeline_tag: image-text-to-text
 tags:
 - agent
+- computer-use
+- gui-grounding
+- vision-language
+metrics:
+- accuracy
 ---
+# GroundNext-7B-V0
+<p align="center">
+&nbsp&nbsp🌐 <a href="https://groundcua.github.io">Website</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>&nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>&nbsp&nbsp
+</p>
+## Highlights
+**GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features:
+- **Superior grounding accuracy** achieving 48.9% on ScreenSpot-Pro, 55.6% on OSWorld-G, and 31.3% on UI-Vision benchmarks
+- **Exceptional cross-platform generalization** with 83.7% accuracy on MMBench-GUI and 92.8% on ScreenSpot-v2 despite desktop-only training
+- **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
+- **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models
+- **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions
+![Performance Comparison](https://via.placeholder.com/800x400?text=GroundNext+Performance+Visualization)
+## Model Overview
+**GroundNext-7B-V0** has the following characteristics:
+- **Type**: Vision-Language Model for GUI Grounding
+- **Base Model**: Qwen2.5-VL-7B-Instruct
+- **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
+- **Number of Parameters**: 7.0B
+- **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset
+- **Context Length**: 262,144 tokens (inherited from base model)
+- **Specialization**: Desktop GUI element grounding with cross-platform generalization
+For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io).
+## Performance
+### Desktop Grounding Benchmarks
+|  | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
+|--- | --- | --- | --- |
+| **ScreenSpot-Pro** | 27.6 | 38.1 | **48.9** |
+| **OSWorld-G** | 31.4 | 57.1 | **55.6** |
+| **UI-Vision** | 0.85 | 25.5 | **31.3** |
+| **Avg (Desktop)** | 19.9 | 40.2 | **45.3** |
+### Cross-Platform Generalization (Mobile & Web)
+|  | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** |
+|--- | --- | --- | --- |
+| **MMBench-GUI** | 72.3 | 78.5 | **83.7** |
+| **ScreenSpot-v2** | 88.8 | 90.3 | **92.8** |
+| **Avg (Mobile/Web)** | 80.6 | 84.4 | **88.3** |
+### Agentic Performance on OSWorld
+When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities:
+| Model | OS | Office | Daily | Pro | Workflow | Overall |
+|--- | --- | --- | --- | --- | --- | --- |
+| OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 |
+| CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 |
+| OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 |
+| UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 |
+| JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** |
+| **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 |
+*Note: GroundNext-7B-V0 results with o3 integration forthcoming.*
+## Quickstart
+The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation.
+With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`.
+### Installation
+```bash
+pip install transformers>=4.37.0 torch torchvision accelerate
+pip install qwen-vl-utils  # For image processing utilities
+```
+### Basic Inference
+The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:
+```python
+import torch
+from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils.vision_process import smart_resize
 from PIL import Image
+# System prompt for computer use grounding
+GROUNDNEXT_SYSTEM_PROMPT = """You are a helpful assistant.
 # Tools
 model_name = "ServiceNow/GroundNext-7B-V0"
+# Load model and processor
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    device_map="auto",
+    trust_remote_code=True
+).eval()
 processor = AutoProcessor.from_pretrained(model_name)
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+# Configure generation
+model.generation_config.temperature = 0.0
+model.generation_config.do_sample = False
 model.generation_config.use_cache = True
+# Load and prepare image
 image_path = "./screenshot.png"
 image = Image.open(image_path).convert('RGB')
 width, height = image.size
+# Resize image using smart_resize
 resized_height, resized_width = smart_resize(
     height,
     width,
     min_pixels=78_400,
     max_pixels=6_000_000,
 )
 image = image.resize((resized_width, resized_height))
+# Create messages
+instruction = "Click on the 'Save' icon"
 messages = [
     {
+        "role": "system",
+        "content": GROUNDNEXT_SYSTEM_PROMPT.format(width=resized_width, height=resized_height)
     },
     {
         "role": "user",
         "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": instruction},
         ],
     }
 ]
+# Prepare inputs
+input_text = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    tokenize=False
+)
+inputs = processor(
+    text=[input_text],
+    images=[image],
+    videos=None,
+    padding=True,
+    return_tensors="pt",
+).to(model.device)
+# Generate response
+generated_ids = model.generate(**inputs, max_new_tokens=128)
 generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
 ]
 response = processor.batch_decode(
+    generated_ids_trimmed,
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=False
 )[0]
 print(response)
+# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>
+```
+### Deployment with vLLM
+For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:
+**vLLM**:
+```bash
+vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192
+```
+**Note**: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.
+## Best Practices
+To achieve optimal grounding performance, we recommend:
+1. **Image Preprocessing**:
+   - Use high-resolution screenshots (minimum 800x600)
+   - Ensure UI elements are clearly visible
+   - Maintain original aspect ratios when resizing
+2. **Prompt Engineering**:
+   - Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner")
+   - Include element attributes when available (color, position, text)
+   - Use consistent terminology matching the UI
+3. **Generation Parameters**:
+   - Use `temperature=0.0` for deterministic grounding
+   - Set `max_new_tokens=128` (sufficient for tool calls)
+   - Enable `use_cache=True` for faster inference
+4. **System Prompt**:
+   - Always include the system prompt with actual screen dimensions
+   - Replace `{width}` and `{height}` with true screenshot dimensions
+   - Maintain the tool signature format for proper JSON parsing
+5. **Post-processing**:
+   - Parse `<tool_call>` tags to extract JSON
+   - Validate coordinates are within screen bounds
+   - Handle cases where model may describe element instead of providing coordinates
+## Training
+GroundNext-7B-V0 was trained using a two-stage approach:
+1. **Supervised Fine-tuning (SFT)**: Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
+2. **Reinforcement Learning (RLOO)**: Further optimized using reward-based learning with custom GUI grounding rewards
+For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA).
+## Limitations and Future Work
+- **Desktop-focused**: Primarily trained on desktop environments (though shows strong cross-platform generalization)
+- **Action space**: Currently supports mouse and keyboard actions; additional modalities under exploration
+- **Languages**: Optimized for English UI elements; multilingual support in development
+- **Resolution**: Performance may vary with extremely high or low resolution images
+## Citation
+If you use GroundNext-7B-V0 in your research, please cite:
+```bibtex
+@misc{feizi2025groundingcomputeruseagents,
+      title={Grounding Computer Use Agents on Human Demonstrations},
+      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
+      year={2025},
+      eprint={2511.07332},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2511.07332},
+}
+```
+## License
+This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/blob/main/LICENSE) for details.
+## Acknowledgements
+We thank:
+- The Qwen team for the excellent Qwen2.5-VL foundation models
+- The open-source community for tools and frameworks that made this work possible
+- Human annotators who contributed to the GroundCUA dataset