Kodeseer-9B / README.md

mdabis

Upload README.md with huggingface_hub

596e8de verified about 1 month ago

preview code

raw

history blame contribute delete

7.46 kB

metadata

language:
  - en
license: apache-2.0
base_model: Qwen/Qwen3.5-9B
tags:
  - gui-grounding
  - screenspot
  - lora
  - qwen3.5
datasets:
  - showlab/ShowUI-desktop
  - zonghanHZH/UGround-V1-8k
  - zonghanHZH/AMEX-8k
  - Hcompany/WebClick
metrics:
  - accuracy
pipeline_tag: image-text-to-text

Kodeseer-9B

A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.

Results

Benchmark	Score	Rank
ScreenSpot-V2	94.7%	#7 overall
ScreenSpot-Pro	65.0%	#9 overall
ScreenSpot Original	92.1%	#1 overall

ScreenSpot-V2 Breakdown

Split	Accuracy
Mobile	95.2%
Desktop	94.6%
Web	92.9%
Overall	94.7%

ScreenSpot-Pro Full Breakdown (1581 samples)

Category	Accuracy	Category	Accuracy
eviews	90.0%	word	88.1%
powerpoint	82.9%	unreal_engine	80.0%
vmware	78.0%	matlab	77.4%
davinci	75.0%	solidworks	72.7%
linux_common	70.0%	photoshop	68.6%
android_studio	66.2%	pycharm	66.7%
quartus	64.4%	inventor	64.3%
vivado	63.7%	vscode	61.8%
blender	60.6%	windows_common	59.3%
illustrator	58.1%	macos_common	53.8%
excel	51.6%	premiere	48.1%
stata	46.9%	autocad	41.2%
fruitloops	40.4%	origin	38.7%
Overall	65.0%

Comparison with State-of-the-Art

ScreenSpot-V2

Rank	Model	Size	Score
1	MAI-UI	32B	96.5%
2	OmegaUse	30B-A3B MoE	96.3%
3	UI-Venus-1.5	30B-A3B MoE	96.2%
4	UI-Venus-1.5	8B	95.9%
5	UI-Venus-1.0	72B	95.3%
6	MAI-UI / GTA1	8B / 32B	95.2%
7	Kodeseer-9B	9B	94.7%
8	UI-TARS 1.5	7B	94.2%
9	UI-Venus-1.0	7B	94.1%
10	Step-GUI	4B	93.6%

ScreenSpot-Pro

Rank	Model	Size	Score
1	Holo2 (3-step)	235B-A22B MoE	78.5%
2	MAI-UI + zoom-in	32B	73.5%
3	Holo2 (1-step)	235B-A22B MoE	70.6%
4	UI-Venus-1.5	30B-A3B MoE	69.6%
5	UI-Venus-1.5	8B	68.4%
6	MAI-UI	32B	67.9%
7	Holo2	30B-A3B MoE	66.1%
8	MAI-UI	8B	65.8%
9	Kodeseer-9B	9B	65.0%
10	Qwen3-VL + MVP	8B	65.3%*
11	GTA1	32B	63.6%
12	UI-TARS 1.5	7B	61.6%

*MVP is a training-free inference trick

ScreenSpot Original

Rank	Model	Size	Score
1	Kodeseer-9B	9B	92.1%
2	GUI-G2	7B	92.0%
3	GUI-Actor-7B + Verifier	7B	89.7%
4	UI-TARS-7B	7B	89.5%
5	UGround-V1	72B	89.4%

Usage

import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image

base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()

image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"

messages = [
    {"role": "system", "content": (
        "You are a GUI grounding assistant. Given a screenshot and a user instruction, "
        "return the exact coordinates of the target UI element using the format: "
        "<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
    )},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": instruction},
    ]},
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)

generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>

Coordinate Format

The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)

Training Details

Base model: Qwen/Qwen3.5-9B (9.65B parameters)
Method: LoRA (rank 32, alpha 64, all-linear targets)
Frozen: ViT + aligner (only LLM LoRA trained)
MAX_PIXELS: 3,014,656 (3M — critical for ScreenSpot-Pro's tiny targets)
Epochs: 3
Learning rate: 5e-5, cosine scheduler, 5% warmup
Effective batch size: 24 (1 per device × 3 grad_accum × 8 GPUs)
Hardware: 8x NVIDIA A40 (48GB each)
Training time: ~4.5 hours
Best checkpoint: step 3100 (selected by eval_loss)
dtype: bfloat16
Framework: ms-swift 4.0.2, transformers 5.2.0

Training Data (~26K samples)

Source	Samples	Description
ShowUI-desktop	7,496	General desktop UI screenshots
UGround-V1-8k (filtered)	~6,920	Web UI, quality filtered (removed <3 word instructions, duplicates, OOB points)
AMEX-8k	8,000	Mobile UI (e-commerce/financial)
Hcompany/WebClick	1,639	Web interaction data
Paraphrased instructions	2,000	Augmented 1-4 word instructions into 7-12 word natural language
Total	~26,055

Data Filtering (UGround)

Original UGround-V1-8k (~8K samples) was filtered to ~6.9K:

Removed instructions with fewer than 3 words (too vague)
Removed duplicate (image, instruction) pairs
Removed out-of-bounds coordinate points
Removed corrupted/missing images

Training Curve

Eval loss decreased steadily from 0.38 (step 100) to ~0.229 (step 2100), then plateaued
Token accuracy reached 92%+ on validation set
No significant overfitting observed with 3 epochs (unlike 4B v4 which overfit at epoch 3+)
VRAM usage: ~31 GB per GPU (of 48 GB available)

Key Design Decisions

3M pixels (MAX_PIXELS=3,014,656): Critical for ScreenSpot-Pro where average UI target is only 0.07% of screen area on 2560x1440+ screenshots
LoRA rank 32 (vs 16 on 4B): Bigger model benefits from more trainable parameters
LR 5e-5 (vs 1e-4 on 4B): Lower learning rate for larger model stability
3 epochs (vs 4 on 4B): Avoided overfitting observed in 4B v4 training
Frozen ViT + aligner: Only LLM layers trained via LoRA — preserves visual encoder quality

Limitations

Trained on English instructions only
Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
SFT-only — no RL/GRPO applied yet (further gains expected)
No training data from professional software domains (all training data is general desktop/mobile/web)

License

Apache 2.0 (same as base model)