Kodeseer-9B / README.md
mdabis's picture
Upload README.md with huggingface_hub
596e8de verified
metadata
language:
  - en
license: apache-2.0
base_model: Qwen/Qwen3.5-9B
tags:
  - gui-grounding
  - screenspot
  - lora
  - qwen3.5
datasets:
  - showlab/ShowUI-desktop
  - zonghanHZH/UGround-V1-8k
  - zonghanHZH/AMEX-8k
  - Hcompany/WebClick
metrics:
  - accuracy
pipeline_tag: image-text-to-text

Kodeseer-9B

A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.

Results

Benchmark Score Rank
ScreenSpot-V2 94.7% #7 overall
ScreenSpot-Pro 65.0% #9 overall
ScreenSpot Original 92.1% #1 overall

ScreenSpot-V2 Breakdown

Split Accuracy
Mobile 95.2%
Desktop 94.6%
Web 92.9%
Overall 94.7%

ScreenSpot-Pro Full Breakdown (1581 samples)

Category Accuracy Category Accuracy
eviews 90.0% word 88.1%
powerpoint 82.9% unreal_engine 80.0%
vmware 78.0% matlab 77.4%
davinci 75.0% solidworks 72.7%
linux_common 70.0% photoshop 68.6%
android_studio 66.2% pycharm 66.7%
quartus 64.4% inventor 64.3%
vivado 63.7% vscode 61.8%
blender 60.6% windows_common 59.3%
illustrator 58.1% macos_common 53.8%
excel 51.6% premiere 48.1%
stata 46.9% autocad 41.2%
fruitloops 40.4% origin 38.7%
Overall 65.0%

Comparison with State-of-the-Art

ScreenSpot-V2

Rank Model Size Score
1 MAI-UI 32B 96.5%
2 OmegaUse 30B-A3B MoE 96.3%
3 UI-Venus-1.5 30B-A3B MoE 96.2%
4 UI-Venus-1.5 8B 95.9%
5 UI-Venus-1.0 72B 95.3%
6 MAI-UI / GTA1 8B / 32B 95.2%
7 Kodeseer-9B 9B 94.7%
8 UI-TARS 1.5 7B 94.2%
9 UI-Venus-1.0 7B 94.1%
10 Step-GUI 4B 93.6%

ScreenSpot-Pro

Rank Model Size Score
1 Holo2 (3-step) 235B-A22B MoE 78.5%
2 MAI-UI + zoom-in 32B 73.5%
3 Holo2 (1-step) 235B-A22B MoE 70.6%
4 UI-Venus-1.5 30B-A3B MoE 69.6%
5 UI-Venus-1.5 8B 68.4%
6 MAI-UI 32B 67.9%
7 Holo2 30B-A3B MoE 66.1%
8 MAI-UI 8B 65.8%
9 Kodeseer-9B 9B 65.0%
10 Qwen3-VL + MVP 8B 65.3%*
11 GTA1 32B 63.6%
12 UI-TARS 1.5 7B 61.6%

*MVP is a training-free inference trick

ScreenSpot Original

Rank Model Size Score
1 Kodeseer-9B 9B 92.1%
2 GUI-G2 7B 92.0%
3 GUI-Actor-7B + Verifier 7B 89.7%
4 UI-TARS-7B 7B 89.5%
5 UGround-V1 72B 89.4%

Usage

import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image

base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"

messages = [
    {"role": "system", "content": (
        "You are a GUI grounding assistant. Given a screenshot and a user instruction, "
        "return the exact coordinates of the target UI element using the format: "
        "<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
    )},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": instruction},
    ]},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>

Coordinate Format

The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)

Training Details

  • Base model: Qwen/Qwen3.5-9B (9.65B parameters)
  • Method: LoRA (rank 32, alpha 64, all-linear targets)
  • Frozen: ViT + aligner (only LLM LoRA trained)
  • MAX_PIXELS: 3,014,656 (3M — critical for ScreenSpot-Pro's tiny targets)
  • Epochs: 3
  • Learning rate: 5e-5, cosine scheduler, 5% warmup
  • Effective batch size: 24 (1 per device × 3 grad_accum × 8 GPUs)
  • Hardware: 8x NVIDIA A40 (48GB each)
  • Training time: ~4.5 hours
  • Best checkpoint: step 3100 (selected by eval_loss)
  • dtype: bfloat16
  • Framework: ms-swift 4.0.2, transformers 5.2.0

Training Data (~26K samples)

Source Samples Description
ShowUI-desktop 7,496 General desktop UI screenshots
UGround-V1-8k (filtered) ~6,920 Web UI, quality filtered (removed <3 word instructions, duplicates, OOB points)
AMEX-8k 8,000 Mobile UI (e-commerce/financial)
Hcompany/WebClick 1,639 Web interaction data
Paraphrased instructions 2,000 Augmented 1-4 word instructions into 7-12 word natural language
Total ~26,055

Data Filtering (UGround)

Original UGround-V1-8k (~8K samples) was filtered to ~6.9K:

  • Removed instructions with fewer than 3 words (too vague)
  • Removed duplicate (image, instruction) pairs
  • Removed out-of-bounds coordinate points
  • Removed corrupted/missing images

Training Curve

  • Eval loss decreased steadily from 0.38 (step 100) to ~0.229 (step 2100), then plateaued
  • Token accuracy reached 92%+ on validation set
  • No significant overfitting observed with 3 epochs (unlike 4B v4 which overfit at epoch 3+)
  • VRAM usage: ~31 GB per GPU (of 48 GB available)

Key Design Decisions

  1. 3M pixels (MAX_PIXELS=3,014,656): Critical for ScreenSpot-Pro where average UI target is only 0.07% of screen area on 2560x1440+ screenshots
  2. LoRA rank 32 (vs 16 on 4B): Bigger model benefits from more trainable parameters
  3. LR 5e-5 (vs 1e-4 on 4B): Lower learning rate for larger model stability
  4. 3 epochs (vs 4 on 4B): Avoided overfitting observed in 4B v4 training
  5. Frozen ViT + aligner: Only LLM layers trained via LoRA — preserves visual encoder quality

Limitations

  • Trained on English instructions only
  • Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
  • SFT-only — no RL/GRPO applied yet (further gains expected)
  • No training data from professional software domains (all training data is general desktop/mobile/web)

License

Apache 2.0 (same as base model)