Instructions to use xlangai/OpenCUA-72B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use xlangai/OpenCUA-72B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="xlangai/OpenCUA-72B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("xlangai/OpenCUA-72B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use xlangai/OpenCUA-72B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "xlangai/OpenCUA-72B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xlangai/OpenCUA-72B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/xlangai/OpenCUA-72B

SGLang

How to use xlangai/OpenCUA-72B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "xlangai/OpenCUA-72B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xlangai/OpenCUA-72B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "xlangai/OpenCUA-72B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "xlangai/OpenCUA-72B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use xlangai/OpenCUA-72B with Docker Model Runner:
```
docker model run hf.co/xlangai/OpenCUA-72B
```

OpenCUA-72B

🌐 Website

📝 Paper

💻 Code

🚀 vLLM Serve (Recommended)

We recommend using vLLM for production deployment. Requires vllm>=0.12.0 with --trust-remote-code.

# OpenCUA-72B (8 GPUs, tensor parallel)
vllm serve xlangai/OpenCUA-72B \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --served-model-name opencua-72b \
  --host 0.0.0.0 \
  --port 8000

# OpenCUA-72B with data parallelism (tp=2, dp=4 for 4 instances on 8 GPUs)
vllm serve xlangai/OpenCUA-72B \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 \
  --port 8000

Adjust --tensor-parallel-size, --data-parallel-size, and --gpu-memory-utilization based on your hardware configuration.

Introduction

OpenCUA models (OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B) are end-to-end computer-use foundation models that can produce executable actions in the computer environments with great planning and grounding capabilities. They are based on the Qwen2.5-VL model family.

With the help of OpenCUA framework, our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models. OpenCUA-72B also has strong grounding ability, achieving 37.3% (SOTA) on UI-Vision and 60.8% on ScreenSpot-Pro.

📢 Updates

2026-01-17: 🎉 vLLM now fully supports OpenCUA-7B, OpenCUA-32B, and OpenCUA-72B! Thanks to the Meituan EvoCUA Team for their contributions to vLLM integration.

Key Features

Superior Computer-Use Capablity: Able to execute multi-step computer-use actions with effective planning and reasoning
Multi-OS Support: Trained on demonstrations across Ubuntu, Windows, and macOS
Visual Grounding: Strong GUI element recognition and spatial reasoning capabilities
Multi-Image Context: Processes up to 3 screenshot history for better context understanding
Reflective Reasoning: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning

Performance

Online Agent Evaluation

OpenCUA models achieves strong performance on OSWorld-Verified. OpenCUA-72B achieves the best performance among all open-source models with an average success rate of 45.0%, establishing a new state-of-the-art (SOTA).

Model	15 Steps	50 Steps	100 Steps
Proprietary
OpenAI CUA	26.0	31.3	31.4
Seed 1.5-VL	27.9	—	34.1
Claude 3.7 Sonnet	27.1	35.8	35.9
Claude 4 Sonnet	31.2	43.9	41.5
Open-Source
Qwen 2.5-VL-32B-Instruct	3.0	—	3.9
Qwen 2.5-VL-72B-Instruct	4.4	—	5.0
Kimi-VL-A3B	9.7	—	10.3
UI-TARS-72B-DPO	24.0	25.8	27.1
UI-TARS-1.5-7B	24.5	27.3	27.4
OpenCUA-7B (Ours)	24.3	27.9	26.6
OpenCUA-32B (Ours)	29.7	34.1	34.8
OpenCUA-72B (Ours)	39.0	44.9	45.0

OpenCUA scores are the mean of 3 independent runs.

GUI Grounding Performance

Model	OSWorld-G	ScreenSpot-V2	ScreenSpot-Pro	UI-Vision
Qwen2.5-VL-7B	31.4	88.8	27.6	0.85
Qwen2.5-VL-32B	46.5	87.0	39.4	-
UI-TARS-72B	57.1	90.3	38.1	25.5
OpenCUA-7B	55.3	92.3	50.0	29.7
OpenCUA-32B	59.6	93.4	55.3	33.3
OpenCUA-72B	59.2	92.9	60.8	37.3

AgentNetBench (Offline Evaluation)

Model	Coordinate Actions	Content Actions	Function Actions	Average
Qwen2.5-VL-7B	50.7	40.8	3.1	48.0
Qwen2.5-VL-32B	66.6	47.2	41.5	64.8
Qwen2.5-VL-72B	67.2	52.6	50.5	67.0
OpenAI CUA	71.7	57.3	80.0	73.1
OpenCUA-7B	79.0	62.0	44.3	75.2
OpenCUA-32B	81.9	66.1	55.7	79.1

🚀 Quick Start

⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B):

To align with our training infrastructure, we have modified the model in two places:

1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE.
2. Using the same Tokenizer and ChatTemplate as Kimi-VL.
vLLM supported via --trust-remote-code flag. Tokenizer and Chat Template should be aligned if training the models.

Installation & Download

First, install the required dependencies:

conda create -n opencua python=3.12
conda activate opencua
pip install openai>=1.0.0

Download the model weight from huggingface (optional, vLLM can download automatically):

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="xlangai/OpenCUA-72B",
    local_dir="OpenCUA-72B",
    local_dir_use_symlinks=False
)

🎯 GUI Grounding

First, start the vLLM server:

vllm serve xlangai/OpenCUA-72B \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --served-model-name opencua-72b \
  --host 0.0.0.0 \
  --port 8000

Then run the following code to test GUI grounding:

import base64
from openai import OpenAI

# vLLM server configuration
VLLM_BASE_URL = "http://localhost:8000/v1"
MODEL_NAME = "opencua-72b"  # Should match --served-model-name in vllm serve

def encode_image(image_path: str) -> str:
    """Encode image to base64 string."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

def run_grounding(image_path: str, instruction: str) -> str:
    """Run GUI grounding inference via vLLM."""
    client = OpenAI(base_url=VLLM_BASE_URL, api_key="EMPTY")

    system_prompt = (
        "You are a GUI agent. You are given a task and a screenshot of the screen. "
        "You need to perform a series of pyautogui actions to complete the task."
    )

    messages = [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{encode_image(image_path)}"}
                },
                {"type": "text", "text": instruction},
            ],
        },
    ]

    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        max_tokens=512,
        temperature=0,
    )

    return response.choices[0].message.content

# Example usage
image_path = "screenshot.png"
instruction = "Click on the submit button"

result = run_grounding(image_path, instruction)
print("Model output:", result)

Expected result: ```python\npyautogui.click(x=1443, y=343)\n```

You can also run the grounding examples in OpenCUA/model/inference/:

cd ./model/inference/

# vLLM (requires running vLLM server first)
python vllm_inference.py

# HuggingFace Transformers
python huggingface_inference.py

🖥️ Computer Use Agent

OpenCUAAgent is developed in the OSWorld environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default.

Command for running OpenCUA-72B in OSWorld:

    python run_multienv_opencua.py \
        --headless \
        --observation_type screenshot \
        --model OpenCUA-72B \
        --result_dir ./results --test_all_meta_path evaluation_examples/test_all_no_gdrive.json \
        --max_steps 100 \
        --num_envs 30  \
        --coordinate_type qwen25

Important Notes on Coordinate Systems

OpenCUA/OpenCUA-7B – Absolute coordinates
OpenCUA/OpenCUA-32B – Absolute coordinates
OpenCUA/OpenCUA-72B – Absolute coordinates

OpenCUA models output absolute coordinates after smart resize:

# Example output: pyautogui.click(x=960, y=324)
# These are coordinates on the smart-resized image, not the original image

# Convert to original image coordinates:
# Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
    # First, calculate the smart-resized dimensions
    resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)

    # Convert model output to relative coordinates on original image
    rel_x = model_x / resized_width
    rel_y = model_y / resized_height

    # Then convert to absolute coordinates on original image
    abs_x = int(rel_x * original_width)
    abs_y = int(rel_y * original_height)
    return abs_x, abs_y

Understanding Smart Resize for Qwen2.5-based Models:

The Qwen2.5-VL models use a "smart resize" preprocessing that maintains aspect ratio while fitting within pixel constraints. For coordinate conversion, you need the smart resize function from the official Qwen2.5-VL implementation.

Acknowledge

We thank Yu Su, Caiming Xiong, and the anonymous reviewers for their insightful discussions and valuable feedback. We are grateful to Moonshot AI for providing training infrastructure and annotated data. We also sincerely appreciate Hao Yang, Zhengtao Wang, and Yanxu Chen from the Kimi Team for their strong infrastructure support and helpful guidance. We thank Chong Peng, Taofeng Xue, and Qiumian Huang from the Meituan EvoCUA Team for their contributions to vLLM integration. The development of our tool is based on the open-source projects-DuckTrack and OpenAdapt. We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project.

License

This project is licensed under the MIT License for Research and Commercial Use - see the LICENSE file in the root folder for details.

Citation

If you use OpenCUA models in your research, please cite our work:

@misc{wang2025opencuaopenfoundationscomputeruse,
      title={OpenCUA: Open Foundations for Computer-Use Agents},
      author={Xinyuan Wang and Bowen Wang and Dunjie Lu and Junlin Yang and Tianbao Xie and Junli Wang and Jiaqi Deng and Xiaole Guo and Yiheng Xu and Chen Henry Wu and Zhennan Shen and Zhuokai Li and Ryan Li and Xiaochuan Li and Junda Chen and Boyuan Zheng and Peihang Li and Fangyu Lei and Ruisheng Cao and Yeqiao Fu and Dongchan Shin and Martin Shin and Jiarui Hu and Yuyan Wang and Jixuan Chen and Yuxiao Ye and Danyang Zhang and Dikang Du and Hao Hu and Huarong Chen and Zaida Zhou and Haotian Yao and Ziwei Chen and Qizheng Gu and Yipu Wang and Heng Wang and Diyi Yang and Victor Zhong and Flood Sung and Y. Charles and Zhilin Yang and Tao Yu},
      year={2025},
      eprint={2508.09123},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.09123},
}