--- base_model: - Qwen/Qwen2.5-VL-7B-Instruct library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text tags: - agent - computer-use - gui-grounding - vision-language metrics: - accuracy --- # GroundNext-7B-V0
  š Website   |   š Paper   |   š¤ Dataset   |   š¤ Model  
## Highlights **GroundNext-7B-V0** is a state-of-the-art vision-language model for GUI element grounding, developed as part of the **GroundCUA** project. This model features: - **Superior grounding accuracy** achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks - **Exceptional cross-platform generalization** with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training - **Data-efficient training** achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work - **Strong agentic capabilities** reaching 50.6% overall success rate on OSWorld when paired with reasoning models - **Native tool-calling support** with built-in computer use action space for mouse, keyboard, and screen interactions ## Model Overview **GroundNext-7B-V0** has the following characteristics: - **Type**: Vision-Language Model for GUI Grounding - **Base Model**: Qwen2.5-VL-7B-Instruct - **Training Approach**: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO) - **Number of Parameters**: 7.0B - **Training Data**: 700K human-annotated desktop demonstrations from GroundCUA dataset - **Context Length**: 262,144 tokens (inherited from base model) - **Specialization**: Desktop GUI element grounding with cross-platform generalization For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our [paper](https://arxiv.org/abs/2511.07332), [GitHub repository](https://github.com/ServiceNow/GroundCUA), and [project website](https://groundcua.github.io). ## Performance ### Desktop Grounding Benchmarks | | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** | | ------------------ | ------------- | ----------- | ----------------- | | **ScreenSpot-Pro** | 29.7 | 38.1 | **52.9** | | **OSWorld-G** | 42.7 | 57.1 | **67.7** | | **UI-Vision** | 16.5 | 25.5 | **60.3** | | **Avg (Desktop)** | 29.6 | 40.2 | **60.3** | ### Cross-Platform Generalization (Desktop, Mobile & Web) | | Qwen2.5-VL-7B | UI-TARS-72B | **GroundNext-7B-V0** | | -------------------- | ------------- | ----------- | ----------------- | | **MMBench-GUI** | 33.9 | 74.3 | **81.1** | | **ScreenSpot-v2** | 88.8 | 90.3 | **90.4** | | **Avg (Mobile/Web)** | 61.4 | 82.3 | **85.8** | ### Agentic Performance on OSWorld When combined with OpenAI o3 for reasoning, **GroundNext-7B-V0** demonstrates strong end-to-end computer use capabilities: | Model | OS | Office | Daily | Pro | Workflow | Overall | |--- | --- | --- | --- | --- | --- | --- | | OpenAI o3 | 62.5 | 14.5 | 21.4 | 38.8 | 16.5 | 23.0 | | CUA | 23.9 | 34.6 | 55.1 | 18.3 | 18.3 | 31.4 | | OpenCUA-72B | 58.3 | 47.0 | 53.8 | 73.5 | 20.4 | 46.1 | | UI-TARS-1.5-7B | 33.3 | 29.9 | 37.9 | 53.1 | 9.1 | 29.6 | | JEDI-7B w/ o3 | 50.0 | 46.1 | **61.9** | **75.5** | 35.3 | **51.0** | | **GroundNext-3B w/ o3** | **62.5** | **47.0** | 55.0 | 73.5 | **36.5** | 50.6 | *Note: GroundNext-7B-V0 results with o3 integration forthcoming.* ## Quickstart The code of GroundNext-7B-V0 is compatible with the latest Hugging Face `transformers` library and follows the Qwen2.5-VL implementation. With `transformers<4.37.0`, you may encounter compatibility issues. We recommend using `transformers>=4.37.0`. ### Installation ```bash pip install transformers>=4.37.0 torch torchvision accelerate pip install qwen-vl-utils # For image processing utilities ``` ### Basic Inference The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding: ```python import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from PIL import Image import groundcua import io from urllib.request import urlopen model_name = "ServiceNow/GroundNext-7B-V0" # Load model and processor model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", trust_remote_code=True ).eval() processor = AutoProcessor.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) # Configure generation model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE model.generation_config.do_sample = False model.generation_config.use_cache = True # Load and prepare image url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png" image = Image.open(io.BytesIO(urlopen(url).read())) image, (width, height) = groundcua.prepare_image(image) # Create messages and generate instruction = "Click on the 'File' button" messages = groundcua.create_messages(instruction, image, width, height) input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS) generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(response) # Expected output: