sujitvasanth commited on
Commit
2db370b
·
verified ·
1 Parent(s): 3e9b5d8

Create README.md

Browse files

---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- xlangai/AgentNet
- xlangai/aguvis-stage1
- smolagents/aguvis-stage-2
- osunlp/UGround-V1-Data
language:
- en
license: mit
metrics:
- accuracy
- code_eval
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- VLM
- Computer-Use-Agent
- OS-Agent
- GUI
- Grounding
---

<h1 style="
font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;
font-size:48px;
font-weight:700;
line-height:1.25;
text-align:center;
margin:0 0 24px;">
OpenCUA: Open Foundations for Computer-Use Agents
</h1>

<div style="
display:flex;
justify-content:center;
gap:12px;
flex-wrap:wrap;
margin-bottom:28px;">

<a href="https://opencua.xlang.ai/" style="
display:inline-block;
padding:8px 24px;
background:#2b2b2b;
color:#ffffff;
border-radius:36px;
text-decoration:none;
font-weight:600;
font-size:16px;">
🌐 Website
</a>

<a href="https://arxiv.org/abs/2508.09123" style="
display:inline-block;
padding:8px 24px;
background:#2b2b2b;
color:#ffffff;
border-radius:36px;
text-decoration:none;
font-weight:600;
font-size:16px;">
📝 Paper
</a>

<a href="https://github.com/xlang-ai/OpenCUA" style="
display:inline-block;
padding:8px 24px;
background:#2b2b2b;
color:#ffffff;
border-radius:36px;
text-decoration:none;
font-weight:600;
font-size:16px;">
💻 Code
</a>
</div>

<div style="max-width:900px;margin:0 auto;">

# Introduction
<div style="
max-width: 880px; /* 可按需调节整体宽度 */
margin: 0 auto; /* 居中容器 */
text-align: justify; /* 关键:两端对齐 */
text-justify: inter-word; /* 优化英文对齐效果 */
line-height: 1.6;">

OpenCUA models (OpenCUA-7B and OpenCUA-32B) are end-to-end computer-use foundation models than can produce executable actions in the computer environments. They are based on the weights of Qwen2.5-VL-7B-Instruction and Qwen2.5-VL-32B-Instruction.
They demonstrate superior performance across CUA benchmarks. In particular, <b>OpenCUA-32B</b> achieves an average success rate of **34.8%** on [OSWorld-Verified](https://os-world.github.io/),
establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Both models also have strong grounding performance, OpenCUA-32B achieves 59.6% on [OSWorld-G](https://osworld-grounding.github.io/) and 55.3% on [Screenspot-Pro](https://arxiv.org/abs/2504.07981).
</div>

### Key Features

- **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
- **Multi-OS Support**: Trained on demonstrations across Ubuntu, Windows, and macOS
- **Visual Grounding**: Strong GUI element recognition and spatial reasoning capabilities
- **Multi-Image Context**: Processes up to 3 screenshot history for better context understanding
- **Reflective Reasoning**: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning


# Performance

### Online Agent Evaluation
OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins.
It also closes the gap to proprietary Claude models.
<div align="center">

| **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
|-------------------------------|:--------:|:--------:|:---------:|
| **Proprietary** | | | |
| OpenAI CUA | 26.0 | 31.3 | 31.4 |
| Seed 1.5-VL | 27.9 | — | 34.1 |
| Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
| Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
| **Open-Source** | | | |
| Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
| Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
| Kimi-VL-A3B | 9.7 | — | 10.3 |
| UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
| UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
| OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
| **OpenCUA-32B *(Ours)*** | **29.7** | **34.1** | **34.8** |
</div>

*OpenCUA scores are the mean of 3 independent runs.*

### GUI Grounding Performance
<div align="center">

| **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
|-------|-----------|---------------|----------------|
| Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
| Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
| UI-TARS-72B | 57.1 | 90.3 | 38.1 |
| **OpenCUA-A3B** | 48.6 | 91.4 | 28.5 |
| **OpenCUA-Qwen2-7B** | 45.7 | 88.5 | 23.7 |
| **OpenCUA-7B** | 55.3 | 92.3 | 50.0 |
| **OpenCUA-32B** | **59.6** | **93.4** | **55.3** |
</div>


### AgentNetBench (Offline Evaluation)
<div align="center">

| **Model** | **Coordinate Actions** | **Content Actions** | **Function Actions** | **Average** |
|-------|-------------------|-----------------|------------------|---------|
| Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
| Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
| Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
| OpenAI CUA | 71.7 | 57.3 | **80.0** | 73.1 |
| **OpenCUA-7B** | 79.0 | 62.0 | 44.3 | 75.2 |
| **OpenCUA-32B** | **81.9** | 66.1 | 55.7 | **79.1** |
</div>

# 🚀 Quick Start
<div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
<strong>⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B):</strong>

To align with our training infrastructure, we have modified the model in two places:
<ul style="margin-top: 8px;">
<li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
<li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
<li>Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models.</li>
</ul>
</div>


## Installation & Download

First, install the required transformers dependencies:

```bash
conda create -n opencua python=3.10
conda activate opencua
pip install -r requirement.txt
```

Download the model weight from huggingface:
```bash
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="xlangai/OpenCUA-7B",
local_dir="OpenCUA-7B",
local_dir_use_symlinks=False
)
```

## 🎯 GUI Grounding

The following code demonstrates how to use OpenCUA models for GUI grounding tasks:

```python
import base64
import torch
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
from PIL import Image
import json

def encode_image(image_path: str) -> str:
"""Encode image to base64 string for model input."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode()

def load_opencua_model(model_path: str):
"""Load OpenCUA model, tokenizer, and image processor."""
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)

return model, tokenizer, image_processor

def create_grounding_messages(image_path: str, instruction: str):
"""Create chat messages for GUI grounding task."""
system_prompt = (
"You are a GUI agent. You are given a task and a screenshot of the screen. "
"You need to perform a series of pyautogui actions to complete the task."
)

messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": [
{"type": "image", "image": f"data:image/png;base64,{encode_image(image_path)}"},
{"type": "text", "text": instruction},
],
},
]
return messages

def run_inference(model, tokenizer, image_processor, messages, image_path):
"""Run inference on the model."""
# Prepare text input
input_ids = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True
)
input_ids = torch.tensor([input_ids]).to(model.device)

# Prepare image input
image = Image.open(image_path).convert('RGB')
image_info = image_processor.preprocess(images=[image])
pixel_values = torch.tensor(image_info['pixel_values']).to(
dtype=torch.bfloat16, device=model.device
)
grid_thws = torch.tensor(image_info['image_grid_thw'])

# Generate response
with torch.no_grad():
generated_ids = model.generate(
input_ids,
pixel_values=pixel_values,
grid_thws=grid_thws,
max_new_tokens=512,
temperature=0
)

# Decode output
prompt_len = input_ids.shape[1]
generated_ids = generated_ids[:, prompt_len:]
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

return output_text

# Example usage
model_path = "OpenCUA/OpenCUA-7B" # or other model variants
image_path = "screenshot.png"
instruction = "Click on the submit button"

# Load model
model, tokenizer, image_processor = load_opencua_model(model_path)

# Create messages and run inference
messages = create_grounding_messages(image_path, instruction)
result = run_inference(model, tokenizer, image_processor, messages, image_path)

print("Model output:", result)
```

<div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
<em>Expected result:</em> ```python
pyautogui.click(x=1443, y=343)
```
</div>

You can also run the five grounding examples in [OpenCUA/model/inference/hug

Files changed (1) hide show
  1. README.md +0 -0
README.md ADDED
File without changes