--- license: apache-2.0 language: - en base_model: - Qwen/Qwen3-VL-32B-Thinking library_name: transformers tags: - vlm - web-agent - opagent --- # OpAgent-32B **OpAgent-32B** is a powerful, open-source Vision-Language Model (VLM) specifically fine-tuned for autonomous web navigation. It serves as the core single-model engine within the broader **[OpAgent project](https://github.com/codefuse-ai/OpAgent)**. ## Model Details ### Model Description - **Base Model:** `Qwen3-VL-32B-Thinking` - **Fine-tuning Strategy:** Hierarchical Multi-Task SFT followed by Online Agentic RL with a Hybrid Reward mechanism. - **Primary Task:** Autonomous web navigation and task execution. - **Input:** A combination of a natural language task description and a webpage screenshot. - **Output:** A JSON-formatted action (e.g., `click`, `type`, `scroll`) or a final answer. ### Model Sources [optional] - **Repository:** [https://github.com/codefuse-ai/OpAgent] ## Uses This model is designed to be used as a web agent. The primary way to run it is through a high-performance inference engine like **vLLM**, as demonstrated in our [single-model usage guide](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model). Below is a Python code snippet demonstrating how to use `OpAgent-32B` with `vLLM` for a single-step inference. ```python import base64 from vllm import LLM, SamplingParams from PIL import Image from io import BytesIO # --- 1. Helper function to encode image --- def encode_image_to_base64(image_path): with Image.open(image_path) as img: buffered = BytesIO() img.save(buffered, format="PNG") return base64.b64encode(buffered.getvalue()).decode('utf-8') # --- 2. Initialize the vLLM engine --- # Ensure you have enough GPU memory. model_id = "codefuse-ai/OpAgent-32B" llm = LLM( model=model_id, trust_remote_code=True, tensor_parallel_size=1, # Adjust based on your GPU setup gpu_memory_utilization=0.9 ) # --- 3. Prepare the prompt --- # The prompt must include the system message, task description, and the screenshot. task_description = "Search for wireless headphones under $50" screenshot_path = "path/to/your/screenshot.png" # Replace with your screenshot path base64_image = encode_image_to_base64(screenshot_path) # This prompt format is crucial for the agent's performance prompt = f"""system You are a helpful web agent. Your goal is to perform tasks on a web page based on a screenshot and a user's instruction. Output the thinking process in tags, and for each function call, return a json object with function name and arguments within XML tags as follows:\n ... {"name": , "arguments": }. user [SCREENSHOT] Task: {task_description} assistant """ # --- 4. Generate the action --- sampling_params = SamplingParams(temperature=0.0, max_tokens=1024) # The model expects the image to be passed via the `images` parameter outputs = llm.generate( prompts=[prompt], sampling_params=sampling_params, images=[base64_image] ) # --- 5. Print the result --- for output in outputs: generated_text = output.outputs[0].text print("--- Generated Action ---") print(generated_text) ``` For a complete, interactive agent implementation, please see the code in the [`opagent_single_model`](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model) directory of our repository. ## Citation If you use `OpAgent-32B` or the `OAgent` framework in your research, please cite our work: ```bibtex @misc{opagent2026, author = {CodeFuse-AI Team}, title = {OpAgent: Operator Agent for Web Navigation}, year = {2026}, publisher = {GitHub}, howpublished = {\url{https://github.com/codefuse-ai/OpAgent}}, url = {https://github.com/codefuse-ai/OpAgent} }