Fine-tuned Qwen3-VL-4B for GUI Click Actions (Cropped)

Fine-tuned on cropped GUI screenshots for click coordinate prediction.

Usage

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image

model = Qwen3VLForConditionalGeneration.from_pretrained("BLR2/qwen3-vl-4b-gui-agent-cropped", torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained("BLR2/qwen3-vl-4b-gui-agent-cropped")

image = Image.open("screenshot.png")  # Should be cropped to 640x840
instruction = "Click on the submit button"

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": instruction},
    ],
}]

inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)  # "0.5234 0.7891"

With vLLM

vllm serve BLR2/qwen3-vl-4b-gui-agent-cropped --dtype bfloat16

Output Format

Outputs normalized coordinates: x y where both are in [0, 1].

Convert to pixels: px_x = int(x * image_width), px_y = int(y * image_height)

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for BLR2/qwen3-vl-4b-gui-agent-cropped

Base model

Qwen/Qwen3-VL-4B-Instruct

Finetuned

(243)

this model