Instructions to use Zery/CUA_World_State_Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zery/CUA_World_State_Model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Zery/CUA_World_State_Model") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Zery/CUA_World_State_Model") model = AutoModelForImageTextToText.from_pretrained("Zery/CUA_World_State_Model") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Zery/CUA_World_State_Model with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Zery/CUA_World_State_Model" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zery/CUA_World_State_Model", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Zery/CUA_World_State_Model
- SGLang
How to use Zery/CUA_World_State_Model with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Zery/CUA_World_State_Model" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zery/CUA_World_State_Model", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Zery/CUA_World_State_Model" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zery/CUA_World_State_Model", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Zery/CUA_World_State_Model with Docker Model Runner:
docker model run hf.co/Zery/CUA_World_State_Model
Enhance model card with abstract and usage example for SEAgent
Browse filesThis PR significantly improves the SEAgent model card by:
- Adding the full abstract of the paper "SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience" to provide a comprehensive overview of the model's capabilities and methodology.
- Including both the Hugging Face paper page link and the arXiv link for improved discoverability and reference.
- Providing a clear and executable Python sample usage example using the `transformers` library, making it easier for users to get started with the model.
- Formatting existing links for better readability.
|
@@ -1,17 +1,71 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
tags:
|
| 7 |
- multimodal
|
| 8 |
-
library_name: transformers
|
| 9 |
-
base_model:
|
| 10 |
-
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
|
|
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
library_name: transformers
|
| 7 |
+
license: apache-2.0
|
| 8 |
pipeline_tag: image-text-to-text
|
| 9 |
tags:
|
| 10 |
- multimodal
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
|
| 14 |
+
|
| 15 |
+
This repository hosts the `SEAgent` model, an advanced Computer Use Agent designed for autonomous learning and evolution in novel software environments.
|
| 16 |
+
|
| 17 |
+
## Paper and Resources
|
| 18 |
+
|
| 19 |
+
* **Paper (Hugging Face):** [SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience](https://huggingface.co/papers/2508.04700)
|
| 20 |
+
* **Paper (arXiv):** [https://arxiv.org/abs/2508.04700](https://arxiv.org/abs/2508.04700)
|
| 21 |
+
* **Code:** [GitHub Repository](https://github.com/SunzeY/SEAgent)
|
| 22 |
+
|
| 23 |
+
## Abstract
|
| 24 |
+
|
| 25 |
+
Repurposing large vision-language models (LVLMs) as computer use agents (CUAs) has led to substantial breakthroughs, primarily driven by human-labeled data. However, these models often struggle with novel and specialized software, particularly in scenarios lacking human annotations. To address this challenge, we propose SEAgent, an agentic self-evolving framework enabling CUAs to autonomously evolve through interactions with unfamiliar software. Specifically, SEAgent empowers computer-use agents to autonomously master novel software environments via experiential learning, where agents explore new software, learn through iterative trial-and-error, and progressively tackle auto-generated tasks organized from simple to complex. To achieve this goal, we design a World State Model for step-wise trajectory assessment, along with a Curriculum Generator that generates increasingly diverse and challenging tasks. The agent's policy is updated through experiential learning, comprised of adversarial imitation of failure actions and Group Relative Policy Optimization (GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist training strategy that integrates individual experiential insights from specialist agents, facilitating the development of a stronger generalist CUA capable of continuous autonomous evolution. This unified agent ultimately achieves performance surpassing ensembles of individual specialist agents on their specialized software. We validate the effectiveness of SEAgent across five novel software environments within OS-World. Our approach achieves a significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a competitive open-source CUA, i.e., UI-TARS.
|
| 26 |
+
|
| 27 |
+
## Usage
|
| 28 |
+
|
| 29 |
+
Here's how to use the SEAgent model for image-to-text generation (e.g., visual question answering or computer interaction tasks) using the `transformers` library:
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
import torch
|
| 33 |
+
from transformers import AutoProcessor, AutoModelForCausalLM
|
| 34 |
+
from PIL import Image
|
| 35 |
+
|
| 36 |
+
# Load processor and model
|
| 37 |
+
processor = AutoProcessor.from_pretrained("SunzeY/SEAgent")
|
| 38 |
+
model = AutoModelForCausalLM.from_pretrained("SunzeY/SEAgent", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
|
| 39 |
+
|
| 40 |
+
# Example image input (replace with your image path or PIL Image object)
|
| 41 |
+
# You would typically load an image relevant to your computer use task.
|
| 42 |
+
image_path = "path/to/your/image.jpg" # Replace with actual image path
|
| 43 |
+
image = Image.open(image_path).convert("RGB")
|
| 44 |
+
|
| 45 |
+
# Prepare conversation turns (example: simple visual question answering for demonstration)
|
| 46 |
+
messages = [
|
| 47 |
+
{"role": "user", "content": "<img></img>What is in the image?"},
|
| 48 |
+
]
|
| 49 |
+
|
| 50 |
+
# Process inputs
|
| 51 |
+
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
| 52 |
+
model_inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
|
| 53 |
+
|
| 54 |
+
# Generate response
|
| 55 |
+
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
|
| 56 |
+
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 57 |
+
|
| 58 |
+
print(generated_text)
|
| 59 |
+
|
| 60 |
+
# Example: More complex interaction potentially involving object detection (if supported by the model's fine-tuning)
|
| 61 |
+
messages_with_object_detection = [
|
| 62 |
+
{"role": "user", "content": "<img></img>Locate the cat and tell me its color."},
|
| 63 |
+
]
|
| 64 |
+
text_with_obj = processor.apply_chat_template(messages_with_object_detection, add_generation_prompt=True, tokenize=False)
|
| 65 |
+
model_inputs_with_obj = processor(text=text_with_obj, images=image, return_tensors="pt").to(model.device)
|
| 66 |
|
| 67 |
+
generated_ids_with_obj = model.generate(**model_inputs_with_obj, max_new_tokens=512)
|
| 68 |
+
generated_text_with_obj = processor.batch_decode(generated_ids_with_obj, skip_special_tokens=True)[0]
|
| 69 |
|
| 70 |
+
print(generated_text_with_obj)
|
| 71 |
+
```
|