Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,118 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- Qwen/Qwen3-VL-32B-Thinking
|
| 7 |
+
library_name: transformers
|
| 8 |
+
tags:
|
| 9 |
+
- vlm
|
| 10 |
+
- web-agent
|
| 11 |
+
- opagent
|
| 12 |
+
---
|
| 13 |
+
# OpAgent-32B
|
| 14 |
+
|
| 15 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 16 |
+
|
| 17 |
+
**OpAgent-32B** is a powerful, open-source Vision-Language Model (VLM) specifically fine-tuned for autonomous web navigation. It serves as the core single-model engine within the broader **[OpAgent project](https://github.com/codefuse-ai/OpAgent)**.
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
## Model Details
|
| 22 |
+
|
| 23 |
+
### Model Description
|
| 24 |
+
|
| 25 |
+
<!-- Provide a longer summary of what this model is. -->
|
| 26 |
+
|
| 27 |
+
- **Base Model:** `Qwen3-VL-32B-Thinking`
|
| 28 |
+
- **Fine-tuning Strategy:** Hierarchical Multi-Task SFT followed by Online Agentic RL with a Hybrid Reward mechanism.
|
| 29 |
+
- **Primary Task:** Autonomous web navigation and task execution.
|
| 30 |
+
- **Input:** A combination of a natural language task description and a webpage screenshot.
|
| 31 |
+
- **Output:** A JSON-formatted action (e.g., `click`, `type`, `scroll`) or a final answer.
|
| 32 |
+
|
| 33 |
+
### Model Sources [optional]
|
| 34 |
+
|
| 35 |
+
<!-- Provide the basic links for the model. -->
|
| 36 |
+
|
| 37 |
+
- **Repository:** [https://github.com/codefuse-ai/OpAgent]
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
## Uses
|
| 42 |
+
|
| 43 |
+
This model is designed to be used as a web agent. The primary way to run it is through a high-performance inference engine like **vLLM**, as demonstrated in our [single-model usage guide](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model).
|
| 44 |
+
|
| 45 |
+
Below is a Python code snippet demonstrating how to use `OpAgent-32B` with `vLLM` for a single-step inference.
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
import base64
|
| 49 |
+
from vllm import LLM, SamplingParams
|
| 50 |
+
from PIL import Image
|
| 51 |
+
from io import BytesIO
|
| 52 |
+
|
| 53 |
+
# --- 1. Helper function to encode image ---
|
| 54 |
+
def encode_image_to_base64(image_path):
|
| 55 |
+
with Image.open(image_path) as img:
|
| 56 |
+
buffered = BytesIO()
|
| 57 |
+
img.save(buffered, format="PNG")
|
| 58 |
+
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
| 59 |
+
|
| 60 |
+
# --- 2. Initialize the vLLM engine ---
|
| 61 |
+
# Ensure you have enough GPU memory.
|
| 62 |
+
model_id = "codefuse-ai/OpAgent-32B"
|
| 63 |
+
llm = LLM(
|
| 64 |
+
model=model_id,
|
| 65 |
+
trust_remote_code=True,
|
| 66 |
+
tensor_parallel_size=1, # Adjust based on your GPU setup
|
| 67 |
+
gpu_memory_utilization=0.9
|
| 68 |
+
)
|
| 69 |
+
|
| 70 |
+
# --- 3. Prepare the prompt ---
|
| 71 |
+
# The prompt must include the system message, task description, and the screenshot.
|
| 72 |
+
task_description = "Search for wireless headphones under $50"
|
| 73 |
+
screenshot_path = "path/to/your/screenshot.png" # Replace with your screenshot path
|
| 74 |
+
base64_image = encode_image_to_base64(screenshot_path)
|
| 75 |
+
|
| 76 |
+
# This prompt format is crucial for the agent's performance
|
| 77 |
+
prompt = f"""system
|
| 78 |
+
You are a helpful web agent. Your goal is to perform tasks on a web page based on a screenshot and a user's instruction.
|
| 79 |
+
Output the thinking process in <think> </think> tags, and for each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:\n<think> ... </think><tool_call>{"name": <function-name>, "arguments": <args-json-object>}</tool_call>.
|
| 80 |
+
user
|
| 81 |
+
[SCREENSHOT]
|
| 82 |
+
Task: {task_description}
|
| 83 |
+
assistant
|
| 84 |
+
"""
|
| 85 |
+
|
| 86 |
+
# --- 4. Generate the action ---
|
| 87 |
+
sampling_params = SamplingParams(temperature=0.0, max_tokens=1024)
|
| 88 |
+
|
| 89 |
+
# The model expects the image to be passed via the `images` parameter
|
| 90 |
+
outputs = llm.generate(
|
| 91 |
+
prompts=[prompt],
|
| 92 |
+
sampling_params=sampling_params,
|
| 93 |
+
images=[base64_image]
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
# --- 5. Print the result ---
|
| 97 |
+
for output in outputs:
|
| 98 |
+
generated_text = output.outputs[0].text
|
| 99 |
+
print("--- Generated Action ---")
|
| 100 |
+
print(generated_text)
|
| 101 |
+
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
For a complete, interactive agent implementation, please see the code in the [`opagent_single_model`](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model) directory of our repository.
|
| 105 |
+
|
| 106 |
+
## Citation
|
| 107 |
+
|
| 108 |
+
If you use `OpAgent-32B` or the `OAgent` framework in your research, please cite our work:
|
| 109 |
+
|
| 110 |
+
```bibtex
|
| 111 |
+
@misc{opagent2026,
|
| 112 |
+
author = {CodeFuse-AI Team},
|
| 113 |
+
title = {OpAgent: Operator Agent for Web Navigation},
|
| 114 |
+
year = {2026},
|
| 115 |
+
publisher = {GitHub},
|
| 116 |
+
howpublished = {\url{https://github.com/codefuse-ai/OpAgent}},
|
| 117 |
+
url = {https://github.com/codefuse-ai/OpAgent}
|
| 118 |
+
}
|