Fine-tuned Qwen3-VL-4B for GUI Click Actions (Cropped)
Fine-tuned on cropped GUI screenshots for click coordinate prediction.
Usage
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
model = Qwen3VLForConditionalGeneration.from_pretrained("BLR2/qwen3-vl-4b-gui-agent-cropped", torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained("BLR2/qwen3-vl-4b-gui-agent-cropped")
image = Image.open("screenshot.png") # Should be cropped to 640x840
instruction = "Click on the submit button"
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": instruction},
],
}]
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response) # "0.5234 0.7891"
With vLLM
vllm serve BLR2/qwen3-vl-4b-gui-agent-cropped --dtype bfloat16
Output Format
Outputs normalized coordinates: x y where both are in [0, 1].
Convert to pixels: px_x = int(x * image_width), px_y = int(y * image_height)
- Downloads last month
- 2
Model tree for BLR2/qwen3-vl-4b-gui-agent-cropped
Base model
Qwen/Qwen3-VL-4B-Instruct