Kodeseer-9B
A LoRA fine-tuned Qwen3.5-9B model for GUI element grounding — predicting the (x, y) coordinates of UI elements from screenshots given natural language instructions.
Results
| Benchmark | Score | Rank |
|---|---|---|
| ScreenSpot-V2 | 94.7% | #7 overall |
| ScreenSpot-Pro | 65.0% | #9 overall |
| ScreenSpot Original | 92.1% | #1 overall |
ScreenSpot-V2 Breakdown
| Split | Accuracy |
|---|---|
| Mobile | 95.2% |
| Desktop | 94.6% |
| Web | 92.9% |
| Overall | 94.7% |
ScreenSpot-Pro Full Breakdown (1581 samples)
| Category | Accuracy | Category | Accuracy | |
|---|---|---|---|---|
| eviews | 90.0% | word | 88.1% | |
| powerpoint | 82.9% | unreal_engine | 80.0% | |
| vmware | 78.0% | matlab | 77.4% | |
| davinci | 75.0% | solidworks | 72.7% | |
| linux_common | 70.0% | photoshop | 68.6% | |
| android_studio | 66.2% | pycharm | 66.7% | |
| quartus | 64.4% | inventor | 64.3% | |
| vivado | 63.7% | vscode | 61.8% | |
| blender | 60.6% | windows_common | 59.3% | |
| illustrator | 58.1% | macos_common | 53.8% | |
| excel | 51.6% | premiere | 48.1% | |
| stata | 46.9% | autocad | 41.2% | |
| fruitloops | 40.4% | origin | 38.7% | |
| Overall | 65.0% |
Comparison with State-of-the-Art
ScreenSpot-V2
| Rank | Model | Size | Score |
|---|---|---|---|
| 1 | MAI-UI | 32B | 96.5% |
| 2 | OmegaUse | 30B-A3B MoE | 96.3% |
| 3 | UI-Venus-1.5 | 30B-A3B MoE | 96.2% |
| 4 | UI-Venus-1.5 | 8B | 95.9% |
| 5 | UI-Venus-1.0 | 72B | 95.3% |
| 6 | MAI-UI / GTA1 | 8B / 32B | 95.2% |
| 7 | Kodeseer-9B | 9B | 94.7% |
| 8 | UI-TARS 1.5 | 7B | 94.2% |
| 9 | UI-Venus-1.0 | 7B | 94.1% |
| 10 | Step-GUI | 4B | 93.6% |
ScreenSpot-Pro
| Rank | Model | Size | Score |
|---|---|---|---|
| 1 | Holo2 (3-step) | 235B-A22B MoE | 78.5% |
| 2 | MAI-UI + zoom-in | 32B | 73.5% |
| 3 | Holo2 (1-step) | 235B-A22B MoE | 70.6% |
| 4 | UI-Venus-1.5 | 30B-A3B MoE | 69.6% |
| 5 | UI-Venus-1.5 | 8B | 68.4% |
| 6 | MAI-UI | 32B | 67.9% |
| 7 | Holo2 | 30B-A3B MoE | 66.1% |
| 8 | MAI-UI | 8B | 65.8% |
| 9 | Kodeseer-9B | 9B | 65.0% |
| 10 | Qwen3-VL + MVP | 8B | 65.3%* |
| 11 | GTA1 | 32B | 63.6% |
| 12 | UI-TARS 1.5 | 7B | 61.6% |
*MVP is a training-free inference trick
ScreenSpot Original
| Rank | Model | Size | Score |
|---|---|---|---|
| 1 | Kodeseer-9B | 9B | 92.1% |
| 2 | GUI-G2 | 7B | 92.0% |
| 3 | GUI-Actor-7B + Verifier | 7B | 89.7% |
| 4 | UI-TARS-7B | 7B | 89.5% |
| 5 | UGround-V1 | 72B | 89.4% |
Usage
import torch
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
from PIL import Image
base_model = "Qwen/Qwen3.5-9B"
adapter = "mdabis/qwen35-9b-gui-grounding-v1"
model = Qwen3_5ForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoint-3100")
model.eval()
image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the submit button"
messages = [
{"role": "system", "content": (
"You are a GUI grounding assistant. Given a screenshot and a user instruction, "
"return the exact coordinates of the target UI element using the format: "
"<|box_start|>(x,y)<|box_end|> where x and y are in [0, 1000] range."
)},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": instruction},
]},
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=64, do_sample=False)
generated = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated, skip_special_tokens=False)[0]
print(response)
# Example output: <|box_start|>(512,340)<|box_end|>
Coordinate Format
The model predicts click coordinates in <|box_start|>(x,y)<|box_end|> format where x and y are in [0, 1000] range. To convert to pixel coordinates:
pixel_x = int(x / 1000 * image_width)
pixel_y = int(y / 1000 * image_height)
Training Details
- Base model: Qwen/Qwen3.5-9B (9.65B parameters)
- Method: LoRA (rank 32, alpha 64, all-linear targets)
- Frozen: ViT + aligner (only LLM LoRA trained)
- MAX_PIXELS: 3,014,656 (3M — critical for ScreenSpot-Pro's tiny targets)
- Epochs: 3
- Learning rate: 5e-5, cosine scheduler, 5% warmup
- Effective batch size: 24 (1 per device × 3 grad_accum × 8 GPUs)
- Hardware: 8x NVIDIA A40 (48GB each)
- Training time: ~4.5 hours
- Best checkpoint: step 3100 (selected by eval_loss)
- dtype: bfloat16
- Framework: ms-swift 4.0.2, transformers 5.2.0
Training Data (~26K samples)
| Source | Samples | Description |
|---|---|---|
| ShowUI-desktop | 7,496 | General desktop UI screenshots |
| UGround-V1-8k (filtered) | ~6,920 | Web UI, quality filtered (removed <3 word instructions, duplicates, OOB points) |
| AMEX-8k | 8,000 | Mobile UI (e-commerce/financial) |
| Hcompany/WebClick | 1,639 | Web interaction data |
| Paraphrased instructions | 2,000 | Augmented 1-4 word instructions into 7-12 word natural language |
| Total | ~26,055 |
Data Filtering (UGround)
Original UGround-V1-8k (~8K samples) was filtered to ~6.9K:
- Removed instructions with fewer than 3 words (too vague)
- Removed duplicate (image, instruction) pairs
- Removed out-of-bounds coordinate points
- Removed corrupted/missing images
Training Curve
- Eval loss decreased steadily from 0.38 (step 100) to ~0.229 (step 2100), then plateaued
- Token accuracy reached 92%+ on validation set
- No significant overfitting observed with 3 epochs (unlike 4B v4 which overfit at epoch 3+)
- VRAM usage: ~31 GB per GPU (of 48 GB available)
Key Design Decisions
- 3M pixels (MAX_PIXELS=3,014,656): Critical for ScreenSpot-Pro where average UI target is only 0.07% of screen area on 2560x1440+ screenshots
- LoRA rank 32 (vs 16 on 4B): Bigger model benefits from more trainable parameters
- LR 5e-5 (vs 1e-4 on 4B): Lower learning rate for larger model stability
- 3 epochs (vs 4 on 4B): Avoided overfitting observed in 4B v4 training
- Frozen ViT + aligner: Only LLM layers trained via LoRA — preserves visual encoder quality
Limitations
- Trained on English instructions only
- Weakest on niche professional software (AutoCAD 41.2%, FruitLoops 40.4%, Origin 38.7%)
- SFT-only — no RL/GRPO applied yet (further gains expected)
- No training data from professional software domains (all training data is general desktop/mobile/web)
License
Apache 2.0 (same as base model)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support