You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kodeseer-9B

A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.

Results

Benchmark Score Rank
ScreenSpot-V2 94.7% #7 overall
ScreenSpot-Pro 65.0% #9 overall
ScreenSpot Original 92.1% #1 overall

ScreenSpot-V2 Breakdown

Split Accuracy
Mobile 95.2%
Desktop 94.6%
Web 92.9%
Overall 94.7%

ScreenSpot-Pro Full Breakdown (1581 samples)

Category Accuracy Category Accuracy
eviews 90.0% word 88.1%
powerpoint 82.9% unreal_engine 80.0%
vmware 78.0% matlab 77.4%
davinci 75.0% solidworks 72.7%
linux_common 70.0% photoshop 68.6%
android_studio 66.2% pycharm 66.7%
quartus 64.4% inventor 64.3%
vivado 63.7% vscode 61.8%
blender 60.6% windows_common 59.3%
illustrator 58.1% macos_common 53.8%
excel 51.6% premiere 48.1%
stata 46.9% autocad 41.2%
fruitloops 40.4% origin 38.7%
Overall 65.0%

Comparison with State-of-the-Art

ScreenSpot-V2

Rank Model Size Score
1 MAI-UI 32B 96.5%
2 OmegaUse 30B-A3B MoE 96.3%
3 UI-Venus-1.5 30B-A3B MoE 96.2%
4 UI-Venus-1.5 8B 95.9%
5 UI-Venus-1.0 72B 95.3%
6 MAI-UI / GTA1 8B / 32B 95.2%
7 Kodeseer-9B 9B 94.7%
8 UI-TARS 1.5 7B 94.2%
9 UI-Venus-1.0 7B 94.1%
10 Step-GUI 4B 93.6%

ScreenSpot-Pro

Rank Model Size Score
1 Holo2 (3-step) 235B-A22B MoE 78.5%
2 MAI-UI + zoom-in 32B 73.5%
3 Holo2 (1-step) 235B-A22B MoE 70.6%
4 UI-Venus-1.5 30B-A3B MoE 69.6%
5 UI-Venus-1.5 8B 68.4%
6 MAI-UI 32B 67.9%
7 Holo2 30B-A3B MoE 66.1%
8 MAI-UI 8B 65.8%
9 Kodeseer-9B 9B 65.0%
10 Qwen3-VL + MVP 8B 65.3%*
11 GTA1 32B 63.6%
12 UI-TARS 1.5 7B 61.6%

*MVP is a training-free inference trick

ScreenSpot Original

Rank Model Size Score
1 Kodeseer-9B 9B 92.1%
2 GUI-G2 7B 92.0%
3 GUI-Actor-7B + Verifier 7B 89.7%
4 UI-TARS-7B 7B 89.5%
5 UGround-V1 72B 89.4%

Usage

import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image

base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"

messages = [
    {"role": "system", "content": (
        "You are a GUI grounding assistant. Given a screenshot and a user instruction, "
        "return the exact coordinates of the target UI element using the format: "
        "<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
    )},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": instruction},
    ]},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>

Coordinate Format

The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)

Training Details

  • Base model: Qwen/Qwen3.5-9B (9.65B parameters)
  • Method: LoRA (rank 32, alpha 64, all-linear targets)
  • Frozen: ViT + aligner (only LLM LoRA trained)
  • MAX_PIXELS: 3,014,656 (3M — critical for ScreenSpot-Pro's tiny targets)
  • Epochs: 3
  • Learning rate: 5e-5, cosine scheduler, 5% warmup
  • Effective batch size: 24 (1 per device × 3 grad_accum × 8 GPUs)
  • Hardware: 8x NVIDIA A40 (48GB each)
  • Training time: ~4.5 hours
  • Best checkpoint: step 3100 (selected by eval_loss)
  • dtype: bfloat16
  • Framework: ms-swift 4.0.2, transformers 5.2.0

Training Data (~26K samples)

Source Samples Description
ShowUI-desktop 7,496 General desktop UI screenshots
UGround-V1-8k (filtered) ~6,920 Web UI, quality filtered (removed <3 word instructions, duplicates, OOB points)
AMEX-8k 8,000 Mobile UI (e-commerce/financial)
Hcompany/WebClick 1,639 Web interaction data
Paraphrased instructions 2,000 Augmented 1-4 word instructions into 7-12 word natural language
Total ~26,055

Data Filtering (UGround)

Original UGround-V1-8k (~8K samples) was filtered to ~6.9K:

  • Removed instructions with fewer than 3 words (too vague)
  • Removed duplicate (image, instruction) pairs
  • Removed out-of-bounds coordinate points
  • Removed corrupted/missing images

Training Curve

  • Eval loss decreased steadily from 0.38 (step 100) to ~0.229 (step 2100), then plateaued
  • Token accuracy reached 92%+ on validation set
  • No significant overfitting observed with 3 epochs (unlike 4B v4 which overfit at epoch 3+)
  • VRAM usage: ~31 GB per GPU (of 48 GB available)

Key Design Decisions

  1. 3M pixels (MAX_PIXELS=3,014,656): Critical for ScreenSpot-Pro where average UI target is only 0.07% of screen area on 2560x1440+ screenshots
  2. LoRA rank 32 (vs 16 on 4B): Bigger model benefits from more trainable parameters
  3. LR 5e-5 (vs 1e-4 on 4B): Lower learning rate for larger model stability
  4. 3 epochs (vs 4 on 4B): Avoided overfitting observed in 4B v4 training
  5. Frozen ViT + aligner: Only LLM layers trained via LoRA — preserves visual encoder quality

Limitations

  • Trained on English instructions only
  • Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
  • SFT-only — no RL/GRPO applied yet (further gains expected)
  • No training data from professional software domains (all training data is general desktop/mobile/web)

License

Apache 2.0 (same as base model)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdabis/Kodeseer-9B

Finetuned
Qwen/Qwen3.5-9B
Adapter
(97)
this model

Datasets used to train mdabis/Kodeseer-9B