You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kodeseer-9B

A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.

Results

Benchmark	Score	Rank
ScreenSpot-V2	94.7%	#7 overall
ScreenSpot-Pro	65.0%	#9 overall
ScreenSpot Original	92.1%	#1 overall

ScreenSpot-V2 Breakdown

Split	Accuracy
Mobile	95.2%
Desktop	94.6%
Web	92.9%
Overall	94.7%

ScreenSpot-Pro Full Breakdown (1581 samples)

Category	Accuracy	Category	Accuracy
eviews	90.0%	word	88.1%
powerpoint	82.9%	unreal_engine	80.0%
vmware	78.0%	matlab	77.4%
davinci	75.0%	solidworks	72.7%
linux_common	70.0%	photoshop	68.6%
android_studio	66.2%	pycharm	66.7%
quartus	64.4%	inventor	64.3%
vivado	63.7%	vscode	61.8%
blender	60.6%	windows_common	59.3%
illustrator	58.1%	macos_common	53.8%
excel	51.6%	premiere	48.1%
stata	46.9%	autocad	41.2%
fruitloops	40.4%	origin	38.7%
Overall	65.0%

Comparison with State-of-the-Art

ScreenSpot-V2

Rank	Model	Size	Score
1	MAI-UI	32B	96.5%
2	OmegaUse	30B-A3B MoE	96.3%
3	UI-Venus-1.5	30B-A3B MoE	96.2%
4	UI-Venus-1.5	8B	95.9%
5	UI-Venus-1.0	72B	95.3%
6	MAI-UI / GTA1	8B / 32B	95.2%
7	Kodeseer-9B	9B	94.7%
8	UI-TARS 1.5	7B	94.2%
9	UI-Venus-1.0	7B	94.1%
10	Step-GUI	4B	93.6%

ScreenSpot-Pro

Rank	Model	Size	Score
1	Holo2 (3-step)	235B-A22B MoE	78.5%
2	MAI-UI + zoom-in	32B	73.5%
3	Holo2 (1-step)	235B-A22B MoE	70.6%
4	UI-Venus-1.5	30B-A3B MoE	69.6%
5	UI-Venus-1.5	8B	68.4%
6	MAI-UI	32B	67.9%
7	Holo2	30B-A3B MoE	66.1%
8	MAI-UI	8B	65.8%
9	Kodeseer-9B	9B	65.0%
10	Qwen3-VL + MVP	8B	65.3%*
11	GTA1	32B	63.6%
12	UI-TARS 1.5	7B	61.6%

*MVP is a training-free inference trick

ScreenSpot Original

Rank	Model	Size	Score
1	Kodeseer-9B	9B	92.1%
2	GUI-G2	7B	92.0%
3	GUI-Actor-7B + Verifier	7B	89.7%
4	UI-TARS-7B	7B	89.5%
5	UGround-V1	72B	89.4%

Usage

import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image

base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"

messages = [
    {"role": "system", "content": (
        "You are a GUI grounding assistant. Given a screenshot and a user instruction, "
        "return the exact coordinates of the target UI element using the format: "
        "<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
    )},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": instruction},
    ]},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>

Coordinate Format

The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)

Training Details

Base model: Qwen/Qwen3.5-9B (9.65B parameters)
Method: LoRA (rank 32, alpha 64, all-linear targets)
Frozen: ViT + aligner (only LLM LoRA trained)
MAX_PIXELS: 3,014,656 (3M — critical for ScreenSpot-Pro's tiny targets)
Epochs: 3
Learning rate: 5e-5, cosine scheduler, 5% warmup
Effective batch size: 24 (1 per device × 3 grad_accum × 8 GPUs)
Hardware: 8x NVIDIA A40 (48GB each)
Training time: ~4.5 hours
Best checkpoint: step 3100 (selected by eval_loss)
dtype: bfloat16
Framework: ms-swift 4.0.2, transformers 5.2.0

Training Data (~26K samples)

Source	Samples	Description
ShowUI-desktop	7,496	General desktop UI screenshots
UGround-V1-8k (filtered)	~6,920	Web UI, quality filtered (removed <3 word instructions, duplicates, OOB points)
AMEX-8k	8,000	Mobile UI (e-commerce/financial)
Hcompany/WebClick	1,639	Web interaction data
Paraphrased instructions	2,000	Augmented 1-4 word instructions into 7-12 word natural language
Total	~26,055

Data Filtering (UGround)

Original UGround-V1-8k (~8K samples) was filtered to ~6.9K:

Removed instructions with fewer than 3 words (too vague)
Removed duplicate (image, instruction) pairs
Removed out-of-bounds coordinate points
Removed corrupted/missing images

Training Curve

Eval loss decreased steadily from 0.38 (step 100) to ~0.229 (step 2100), then plateaued
Token accuracy reached 92%+ on validation set
No significant overfitting observed with 3 epochs (unlike 4B v4 which overfit at epoch 3+)
VRAM usage: ~31 GB per GPU (of 48 GB available)

Key Design Decisions

3M pixels (MAX_PIXELS=3,014,656): Critical for ScreenSpot-Pro where average UI target is only 0.07% of screen area on 2560x1440+ screenshots
LoRA rank 32 (vs 16 on 4B): Bigger model benefits from more trainable parameters
LR 5e-5 (vs 1e-4 on 4B): Lower learning rate for larger model stability
3 epochs (vs 4 on 4B): Avoided overfitting observed in 4B v4 training
Frozen ViT + aligner: Only LLM layers trained via LoRA — preserves visual encoder quality

Limitations

Trained on English instructions only
Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
SFT-only — no RL/GRPO applied yet (further gains expected)
No training data from professional software domains (all training data is general desktop/mobile/web)

License

Apache 2.0 (same as base model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdabis/Kodeseer-9B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(396)

this model

mdabis
/

Kodeseer-9B