exias commited on
Commit
6ce04b3
·
verified ·
1 Parent(s): d9eca55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen3-VL-32B-Thinking
7
+ library_name: transformers
8
+ tags:
9
+ - vlm
10
+ - web-agent
11
+ - opagent
12
+ ---
13
+ # OpAgent-32B
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+ **OpAgent-32B** is a powerful, open-source Vision-Language Model (VLM) specifically fine-tuned for autonomous web navigation. It serves as the core single-model engine within the broader **[OpAgent project](https://github.com/codefuse-ai/OpAgent)**.
18
+
19
+
20
+
21
+ ## Model Details
22
+
23
+ ### Model Description
24
+
25
+ <!-- Provide a longer summary of what this model is. -->
26
+
27
+ - **Base Model:** `Qwen3-VL-32B-Thinking`
28
+ - **Fine-tuning Strategy:** Hierarchical Multi-Task SFT followed by Online Agentic RL with a Hybrid Reward mechanism.
29
+ - **Primary Task:** Autonomous web navigation and task execution.
30
+ - **Input:** A combination of a natural language task description and a webpage screenshot.
31
+ - **Output:** A JSON-formatted action (e.g., `click`, `type`, `scroll`) or a final answer.
32
+
33
+ ### Model Sources [optional]
34
+
35
+ <!-- Provide the basic links for the model. -->
36
+
37
+ - **Repository:** [https://github.com/codefuse-ai/OpAgent]
38
+
39
+
40
+
41
+ ## Uses
42
+
43
+ This model is designed to be used as a web agent. The primary way to run it is through a high-performance inference engine like **vLLM**, as demonstrated in our [single-model usage guide](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model).
44
+
45
+ Below is a Python code snippet demonstrating how to use `OpAgent-32B` with `vLLM` for a single-step inference.
46
+
47
+ ```python
48
+ import base64
49
+ from vllm import LLM, SamplingParams
50
+ from PIL import Image
51
+ from io import BytesIO
52
+
53
+ # --- 1. Helper function to encode image ---
54
+ def encode_image_to_base64(image_path):
55
+ with Image.open(image_path) as img:
56
+ buffered = BytesIO()
57
+ img.save(buffered, format="PNG")
58
+ return base64.b64encode(buffered.getvalue()).decode('utf-8')
59
+
60
+ # --- 2. Initialize the vLLM engine ---
61
+ # Ensure you have enough GPU memory.
62
+ model_id = "codefuse-ai/OpAgent-32B"
63
+ llm = LLM(
64
+ model=model_id,
65
+ trust_remote_code=True,
66
+ tensor_parallel_size=1, # Adjust based on your GPU setup
67
+ gpu_memory_utilization=0.9
68
+ )
69
+
70
+ # --- 3. Prepare the prompt ---
71
+ # The prompt must include the system message, task description, and the screenshot.
72
+ task_description = "Search for wireless headphones under $50"
73
+ screenshot_path = "path/to/your/screenshot.png" # Replace with your screenshot path
74
+ base64_image = encode_image_to_base64(screenshot_path)
75
+
76
+ # This prompt format is crucial for the agent's performance
77
+ prompt = f"""system
78
+ You are a helpful web agent. Your goal is to perform tasks on a web page based on a screenshot and a user's instruction.
79
+ Output the thinking process in <think> </think> tags, and for each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:\n<think> ... </think><tool_call>{"name": <function-name>, "arguments": <args-json-object>}</tool_call>.
80
+ user
81
+ [SCREENSHOT]
82
+ Task: {task_description}
83
+ assistant
84
+ """
85
+
86
+ # --- 4. Generate the action ---
87
+ sampling_params = SamplingParams(temperature=0.0, max_tokens=1024)
88
+
89
+ # The model expects the image to be passed via the `images` parameter
90
+ outputs = llm.generate(
91
+ prompts=[prompt],
92
+ sampling_params=sampling_params,
93
+ images=[base64_image]
94
+ )
95
+
96
+ # --- 5. Print the result ---
97
+ for output in outputs:
98
+ generated_text = output.outputs[0].text
99
+ print("--- Generated Action ---")
100
+ print(generated_text)
101
+
102
+ ```
103
+
104
+ For a complete, interactive agent implementation, please see the code in the [`opagent_single_model`](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model) directory of our repository.
105
+
106
+ ## Citation
107
+
108
+ If you use `OpAgent-32B` or the `OAgent` framework in your research, please cite our work:
109
+
110
+ ```bibtex
111
+ @misc{opagent2026,
112
+ author = {CodeFuse-AI Team},
113
+ title = {OpAgent: Operator Agent for Web Navigation},
114
+ year = {2026},
115
+ publisher = {GitHub},
116
+ howpublished = {\url{https://github.com/codefuse-ai/OpAgent}},
117
+ url = {https://github.com/codefuse-ai/OpAgent}
118
+ }