| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | base_model: |
| | - Qwen/Qwen3-VL-32B-Thinking |
| | library_name: transformers |
| | tags: |
| | - vlm |
| | - web-agent |
| | - opagent |
| | --- |
| | # OpAgent-32B |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| |
|
| | **OpAgent-32B** is a powerful, open-source Vision-Language Model (VLM) specifically fine-tuned for autonomous web navigation. It serves as the core single-model engine within the broader **[OpAgent project](https://github.com/codefuse-ai/OpAgent)**. |
| |
|
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | <!-- Provide a longer summary of what this model is. --> |
| |
|
| | - **Base Model:** `Qwen3-VL-32B-Thinking` |
| | - **Fine-tuning Strategy:** Hierarchical Multi-Task SFT followed by Online Agentic RL with a Hybrid Reward mechanism. |
| | - **Primary Task:** Autonomous web navigation and task execution. |
| | - **Input:** A combination of a natural language task description and a webpage screenshot. |
| | - **Output:** A JSON-formatted action (e.g., `click`, `type`, `scroll`) or a final answer. |
| |
|
| | ### Model Sources [optional] |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [https://github.com/codefuse-ai/OpAgent] |
| |
|
| |
|
| |
|
| | ## Uses |
| |
|
| | This model is designed to be used as a web agent. The primary way to run it is through a high-performance inference engine like **vLLM**, as demonstrated in our [single-model usage guide](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model). |
| |
|
| | Below is a Python code snippet demonstrating how to use `OpAgent-32B` with `vLLM` for a single-step inference. |
| |
|
| | ```python |
| | import base64 |
| | from vllm import LLM, SamplingParams |
| | from PIL import Image |
| | from io import BytesIO |
| | |
| | # --- 1. Helper function to encode image --- |
| | def encode_image_to_base64(image_path): |
| | with Image.open(image_path) as img: |
| | buffered = BytesIO() |
| | img.save(buffered, format="PNG") |
| | return base64.b64encode(buffered.getvalue()).decode('utf-8') |
| | |
| | # --- 2. Initialize the vLLM engine --- |
| | # Ensure you have enough GPU memory. |
| | model_id = "codefuse-ai/OpAgent-32B" |
| | llm = LLM( |
| | model=model_id, |
| | trust_remote_code=True, |
| | tensor_parallel_size=1, # Adjust based on your GPU setup |
| | gpu_memory_utilization=0.9 |
| | ) |
| | |
| | # --- 3. Prepare the prompt --- |
| | # The prompt must include the system message, task description, and the screenshot. |
| | task_description = "Search for wireless headphones under $50" |
| | screenshot_path = "path/to/your/screenshot.png" # Replace with your screenshot path |
| | base64_image = encode_image_to_base64(screenshot_path) |
| | |
| | # This prompt format is crucial for the agent's performance |
| | prompt = f"""system |
| | You are a helpful web agent. Your goal is to perform tasks on a web page based on a screenshot and a user's instruction. |
| | Output the thinking process in <think> </think> tags, and for each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:\n<think> ... </think><tool_call>{"name": <function-name>, "arguments": <args-json-object>}</tool_call>. |
| | user |
| | [SCREENSHOT] |
| | Task: {task_description} |
| | assistant |
| | """ |
| | |
| | # --- 4. Generate the action --- |
| | sampling_params = SamplingParams(temperature=0.0, max_tokens=1024) |
| | |
| | # The model expects the image to be passed via the `images` parameter |
| | outputs = llm.generate( |
| | prompts=[prompt], |
| | sampling_params=sampling_params, |
| | images=[base64_image] |
| | ) |
| | |
| | # --- 5. Print the result --- |
| | for output in outputs: |
| | generated_text = output.outputs[0].text |
| | print("--- Generated Action ---") |
| | print(generated_text) |
| | |
| | ``` |
| |
|
| | For a complete, interactive agent implementation, please see the code in the [`opagent_single_model`](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model) directory of our repository. |
| |
|
| | ## Citation |
| |
|
| | If you use `OpAgent-32B` or the `OAgent` framework in your research, please cite our work: |
| |
|
| | ```bibtex |
| | @misc{opagent2026, |
| | author = {CodeFuse-AI Team}, |
| | title = {OpAgent: Operator Agent for Web Navigation}, |
| | year = {2026}, |
| | publisher = {GitHub}, |
| | howpublished = {\url{https://github.com/codefuse-ai/OpAgent}}, |
| | url = {https://github.com/codefuse-ai/OpAgent} |
| | } |
| | |