Update README.md

6ce04b3 verified 17 days ago

4.05 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-VL-32B-Thinking
	library_name: transformers
	tags:
	- vlm
	- web-agent
	- opagent
	---
	# OpAgent-32B

	<!-- Provide a quick summary of what the model is/does. -->

	OpAgent-32B is a powerful, open-source Vision-Language Model (VLM) specifically fine-tuned for autonomous web navigation. It serves as the core single-model engine within the broader [OpAgent project](https://github.com/codefuse-ai/OpAgent).



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Base Model: `Qwen3-VL-32B-Thinking`
	- Fine-tuning Strategy: Hierarchical Multi-Task SFT followed by Online Agentic RL with a Hybrid Reward mechanism.
	- Primary Task: Autonomous web navigation and task execution.
	- Input: A combination of a natural language task description and a webpage screenshot.
	- Output: A JSON-formatted action (e.g., `click`, `type`, `scroll`) or a final answer.

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [https://github.com/codefuse-ai/OpAgent]



	## Uses

	This model is designed to be used as a web agent. The primary way to run it is through a high-performance inference engine like vLLM, as demonstrated in our [single-model usage guide](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model).

	Below is a Python code snippet demonstrating how to use `OpAgent-32B` with `vLLM` for a single-step inference.

	```python
	import base64
	from vllm import LLM, SamplingParams
	from PIL import Image
	from io import BytesIO

	# --- 1. Helper function to encode image ---
	def encode_image_to_base64(image_path):
	with Image.open(image_path) as img:
	buffered = BytesIO()
	img.save(buffered, format="PNG")
	return base64.b64encode(buffered.getvalue()).decode('utf-8')

	# --- 2. Initialize the vLLM engine ---
	# Ensure you have enough GPU memory.
	model_id = "codefuse-ai/OpAgent-32B"
	llm = LLM(
	model=model_id,
	trust_remote_code=True,
	tensor_parallel_size=1, # Adjust based on your GPU setup
	gpu_memory_utilization=0.9
	)

	# --- 3. Prepare the prompt ---
	# The prompt must include the system message, task description, and the screenshot.
	task_description = "Search for wireless headphones under $50"
	screenshot_path = "path/to/your/screenshot.png" # Replace with your screenshot path
	base64_image = encode_image_to_base64(screenshot_path)

	# This prompt format is crucial for the agent's performance
	prompt = f"""system
	You are a helpful web agent. Your goal is to perform tasks on a web page based on a screenshot and a user's instruction.
	Output the thinking process in <think> </think> tags, and for each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:\n<think> ... </think><tool_call>{"name": <function-name>, "arguments": <args-json-object>}</tool_call>.
	user
	[SCREENSHOT]
	Task: {task_description}
	assistant
	"""

	# --- 4. Generate the action ---
	sampling_params = SamplingParams(temperature=0.0, max_tokens=1024)

	# The model expects the image to be passed via the `images` parameter
	outputs = llm.generate(
	prompts=[prompt],
	sampling_params=sampling_params,
	images=[base64_image]
	)

	# --- 5. Print the result ---
	for output in outputs:
	generated_text = output.outputs[0].text
	print("--- Generated Action ---")
	print(generated_text)

	```

	For a complete, interactive agent implementation, please see the code in the [`opagent_single_model`](https://github.com/codefuse-ai/OpAgent/tree/main/opagent_single_model) directory of our repository.

	## Citation

	If you use `OpAgent-32B` or the `OAgent` framework in your research, please cite our work:

	```bibtex
	@misc{opagent2026,
	author = {CodeFuse-AI Team},
	title = {OpAgent: Operator Agent for Web Navigation},
	year = {2026},
	publisher = {GitHub},
	howpublished = {\url{https://github.com/codefuse-ai/OpAgent}},
	url = {https://github.com/codefuse-ai/OpAgent}
	}