A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.
Results
Benchmark
Score
Rank
ScreenSpot-V2
94.7%
#7 overall
ScreenSpot-Pro
65.0%
#9 overall
ScreenSpot Original
92.1%
#1 overall
ScreenSpot-V2 Breakdown
Split
Accuracy
Mobile
95.2%
Desktop
94.6%
Web
92.9%
Overall
94.7%
ScreenSpot-Pro Full Breakdown (1581 samples)
Category
Accuracy
Category
Accuracy
eviews
90.0%
word
88.1%
powerpoint
82.9%
unreal_engine
80.0%
vmware
78.0%
matlab
77.4%
davinci
75.0%
solidworks
72.7%
linux_common
70.0%
photoshop
68.6%
android_studio
66.2%
pycharm
66.7%
quartus
64.4%
inventor
64.3%
vivado
63.7%
vscode
61.8%
blender
60.6%
windows_common
59.3%
illustrator
58.1%
macos_common
53.8%
excel
51.6%
premiere
48.1%
stata
46.9%
autocad
41.2%
fruitloops
40.4%
origin
38.7%
Overall
65.0%
Comparison with State-of-the-Art
ScreenSpot-V2
Rank
Model
Size
Score
1
MAI-UI
32B
96.5%
2
OmegaUse
30B-A3B MoE
96.3%
3
UI-Venus-1.5
30B-A3B MoE
96.2%
4
UI-Venus-1.5
8B
95.9%
5
UI-Venus-1.0
72B
95.3%
6
MAI-UI / GTA1
8B / 32B
95.2%
7
Kodeseer-9B
9B
94.7%
8
UI-TARS 1.5
7B
94.2%
9
UI-Venus-1.0
7B
94.1%
10
Step-GUI
4B
93.6%
ScreenSpot-Pro
Rank
Model
Size
Score
1
Holo2 (3-step)
235B-A22B MoE
78.5%
2
MAI-UI + zoom-in
32B
73.5%
3
Holo2 (1-step)
235B-A22B MoE
70.6%
4
UI-Venus-1.5
30B-A3B MoE
69.6%
5
UI-Venus-1.5
8B
68.4%
6
MAI-UI
32B
67.9%
7
Holo2
30B-A3B MoE
66.1%
8
MAI-UI
8B
65.8%
9
Kodeseer-9B
9B
65.0%
10
Qwen3-VL + MVP
8B
65.3%*
11
GTA1
32B
63.6%
12
UI-TARS 1.5
7B
61.6%
*MVP is a training-free inference trick
ScreenSpot Original
Rank
Model
Size
Score
1
Kodeseer-9B
9B
92.1%
2
GUI-G2
7B
92.0%
3
GUI-Actor-7B + Verifier
7B
89.7%
4
UI-TARS-7B
7B
89.5%
5
UGround-V1
72B
89.4%
Usage
import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image
base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"
model = Qwen3_5ForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()
image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"
messages = [
{"role": "system", "content": (
"You are a GUI grounding assistant. Given a screenshot and a user instruction, ""return the exact coordinates of the target UI element using the format: ""<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
)},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": instruction},
]},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) ifisinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)
generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>
Coordinate Format
The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates: