| | --- |
| | base_model: |
| | - Qwen/Qwen2.5-VL-7B-Instruct |
| | library_name: transformers |
| | license: apache-2.0 |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - agent |
| | - computer-use |
| | - gui-grounding |
| | - vision-language |
| | metrics: |
| | - accuracy |
| | --- |
| | |
| | # GroundNext-7B-V0 |
| |
|
| | <p align="center"> |
| |   🌐 <a href="https://groundcua.github.io">Website</a>   |   📑 <a href="https://arxiv.org/abs/2511.07332">Paper</a>   |   🤗 <a href="https://huggingface.co/datasets/ServiceNow/GroundCUA">Dataset</a>   |   🤖 <a href="https://huggingface.co/ServiceNow/GroundNext-7B-V0">Model</a>   |
| | </p> |
| |
|
| | ## Highlights |
| |
|
| | **GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features: |
| |
|
| | - **Superior grounding accuracy** achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks |
| | - **Exceptional cross-platform generalization** with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training |
| | - **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work |
| | - **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models |
| | - **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions |
| |
|
| | ## Model Overview |
| |
|
| | **GroundNext-7B-V0** has the following characteristics: |
| | - **Type**: Vision-Language Model for GUI Grounding |
| | - **Base Model**: Qwen2.5-VL-7B-Instruct |
| | - **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO) |
| | - **Number of Parameters**: 7.0B |
| | - **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset |
| | - **Context Length**: 262,144 tokens (inherited from base model) |
| | - **Specialization**: Desktop GUI element grounding with cross-platform generalization |
| |
|
| | For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io). |
| |
|
| | ## Performance |
| |
|
| | ### Desktop Grounding Benchmarks |
| |
|
| | | | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** | |
| | | ------------------ | ------------- | ----------- | ----------------- | |
| | | **ScreenSpot-Pro** | 29.7 | 38.1 | **52.9** | |
| | | **OSWorld-G** | 42.7 | 57.1 | **67.7** | |
| | | **UI-Vision** | 16.5 | 25.5 | **60.3** | |
| | | **Avg (Desktop)** | 29.6 | 40.2 | **60.3** | |
| |
|
| | ### Cross-Platform Generalization (Desktop, Mobile & Web) |
| |
|
| | | | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** | |
| | | -------------------- | ------------- | ----------- | ----------------- | |
| | | **MMBench-GUI** | 33.9 | 74.3 | **81.1** | |
| | | **ScreenSpot-v2** | 88.8 | 90.3 | **90.4** | |
| | | **Avg (Mobile/Web)** | 61.4 | 82.3 | **85.8** | |
| |
|
| |
|
| | ### Agentic Performance on OSWorld |
| |
|
| | When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities: |
| |
|
| | | Model | OS | Office | Daily | Pro | Workflow | Overall | |
| | |--- | --- | --- | --- | --- | --- | --- | |
| | | OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 | |
| | | CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 | |
| | | OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 | |
| | | UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 | |
| | | JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** | |
| | | **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 | |
| |
|
| | *Note: GroundNext-7B-V0 results with o3 integration forthcoming.* |
| |
|
| | ## Quickstart |
| |
|
| | The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation. |
| |
|
| | With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`. |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers>=4.37.0 torch torchvision accelerate |
| | pip install qwen-vl-utils # For image processing utilities |
| | ``` |
| |
|
| | ### Basic Inference |
| |
|
| | The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding: |
| |
|
| | ```python |
| | import torch |
| | from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
| | from PIL import Image |
| | import groundcua_utils as groundcua |
| | import io |
| | from urllib.request import urlopen |
| | |
| | model_name = "ServiceNow/GroundNext-7B-V0" |
| | |
| | # Load model and processor |
| | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| | model_name, |
| | torch_dtype=torch.bfloat16, |
| | attn_implementation="flash_attention_2", |
| | device_map="auto", |
| | trust_remote_code=True |
| | ).eval() |
| | |
| | processor = AutoProcessor.from_pretrained(model_name) |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # Configure generation |
| | model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE |
| | model.generation_config.do_sample = False |
| | model.generation_config.use_cache = True |
| | |
| | # Load and prepare image |
| | url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png" |
| | image = Image.open(io.BytesIO(urlopen(url).read())) |
| | image, (width, height) = groundcua.prepare_image(image) |
| | |
| | # Create messages and generate |
| | instruction = "Click on the 'File' button" |
| | messages = groundcua.create_messages(instruction, image, width, height) |
| | |
| | input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
| | inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device) |
| | |
| | generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS) |
| | generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] |
| | |
| | response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
| | print(response) |
| | # Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call> |
| | ``` |
| |
|
| | ### Deployment with vLLM |
| |
|
| | For production deployment, you can use vLLM to create OpenAI-compatible API endpoints: |
| |
|
| | **vLLM**: |
| | ```bash |
| | vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192 |
| | ``` |
| |
|
| | **Note**: Adjust `max-model-len` or `context-length` based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient. |
| |
|
| | ## Best Practices |
| |
|
| | To achieve optimal grounding performance, we recommend: |
| |
|
| | 1. **Image Preprocessing**: |
| | - Use high-resolution screenshots (minimum 800x600) |
| | - Ensure UI elements are clearly visible |
| | - Maintain original aspect ratios when resizing |
| |
|
| | 2. **Prompt Engineering**: |
| | - Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save") |
| | - Include element attributes when available (color, position, text) |
| |
|
| | 3. **Generation Parameters**: |
| | - Use `temperature=0.0` for deterministic grounding |
| | - Set `max_new_tokens=128` (sufficient for tool calls) |
| | - Enable `use_cache=True` for faster inference |
| |
|
| | 4. **System Prompt**: |
| | - Always include the system prompt with actual screen dimensions |
| | - Replace `{width}` and `{height}` with true screenshot dimensions |
| | - Maintain the tool signature format for proper JSON parsing |
| |
|
| | 5. **Post-processing**: |
| | - Parse `<tool_call>` tags to extract JSON |
| | - Validate coordinates are within screen bounds |
| |
|
| | ## Training |
| |
|
| | GroundNext-7B-V0 was trained using a two-stage approach: |
| |
|
| | 1. **Supervised Fine-tuning (SFT)**: Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset |
| | 2. **Reinforcement Learning (RLOO)**: Further optimized using reward-based learning with custom GUI grounding rewards |
| |
|
| | For detailed training instructions, dataset preparation, and reproduction steps, please visit our [GitHub repository](https://github.com/ServiceNow/GroundCUA). |
| |
|
| | ## Limitations and Future Work |
| |
|
| | - **Desktop-focused**: Primarily trained on desktop environments (though shows strong cross-platform generalization) |
| | - **Action space**: Currently supports mouse click action only |
| | - **Languages**: Optimized for English UI elements |
| | - **Resolution**: Performance may vary with extremely high or low resolution images |
| |
|
| | ## Citation |
| |
|
| | If you use GroundNext-7B-V0 in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{feizi2025groundingcomputeruseagents, |
| | title={Grounding Computer Use Agents on Human Demonstrations}, |
| | author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lù and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar}, |
| | year={2025}, |
| | eprint={2511.07332}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2511.07332}, |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the [LICENSE](https://choosealicense.com/licenses/apache-2.0/) for details. |
| |
|
| | ## Acknowledgements |
| |
|
| | We thank: |
| | - The Qwen team for the excellent Qwen2.5-VL foundation models |
| | - The open-source community for tools and frameworks that made this work possible |
| | - Human annotators who contributed to the GroundCUA dataset |