GUI-G2-3B + CCF: Inference-time icon refinement for screen grounding
A drop-in inference wrapper that improves GUI-G2-3B's icon-grounding accuracy by +2.2pp on ScreenSpot-v2 at zero training cost. The base weights are unchanged; everything is in the inference pipeline.
Try it live (Azure A100, scale-to-zero): https://guigrounding.whiteplant-27564a0e.eastus.azurecontainerapps.io Warm latency: ~250-400ms server time / ~700-900ms wall time for fast mode (CCF), ~900ms server time / ~1.6s wall for accurate mode (6-pass self-consistency with real agreement-based confidence). The playground also streams the coarse CCF prediction at ~600ms wall so the dot appears tentatively before the refined pass completes. Cold start ~90s the first time after idle.
Real samples from ScreenSpot-v2. Each tile shows the same instruction predicted by GUI-G2-3B alone (red X) and GUI-G2-3B + CCF (green check, inside the ground-truth bbox). The full per-sample run backing these picks is in benchmarks/demo_candidates.jsonl on the GitHub repo, so the picks are verifiable.
What this is
GUI-G2-3B (89.2% on ScreenSpot-v2) is a strong open-source 3B grounding model. Its main weakness is on small icons, where it lands at 80.5% vs 96.0% on text. We add a single inference-time technique -- Cursor-Centric Focusing (CCF) -- that wraps the base model with a coarse-then-refined prediction loop:
- Run the model once on the full screenshot to get a coarse
(x, y)prediction - Crop a window around that point at 2x zoom (so small icons become big icons)
- Run the model again on the crop, then map the refined prediction back to original coordinates
CCF generalizes the technique from the GUI-Cursor paper to any bbox-style grounding model. We add three engineering details that make it work in production:
- Greedy-only. Earlier stochastic-sampling implementations regressed because temperature noise corrupted already-correct predictions. Both passes are
do_sample=False. - Coarse downsizing. The coarse pass runs at 1.5M pixels (vs 12.8M native), cutting wall time roughly 50% on 1920x1080 screenshots without measurable accuracy impact -- only the refined pass needs native resolution on the cropped region.
- Type-aware gate (optional). A short keyword classifier on the instruction skips the refinement pass when the target is obviously a text element (where refinement adds drift). Adds +1.2pp on mobile vs ungated CCF.
Benchmark
ScreenSpot-v2, full set (1,272 samples), greedy decoding, MAX_PIXELS=12,845,056, H200 GPU with flash-attn 2.
| Configuration | Overall | Desktop | Mobile | Web | Icon | Text |
|---|---|---|---|---|---|---|
| GUI-G2-3B (baseline) | 89.2% | 91.3% | 88.0% | 84.2% | 80.5% | 96.0% |
| GUI-G2-3B + CCF (this) | 88.9% | 91.3% | 88.0% | 88.1% | 82.7% | 93.7% |
| GUI-G2-3B + CCF + type-gate | 88.9% | 91.0% | 89.2% | 87.0% | 82.7% | 93.7% |
Headline numbers vs the unmodified base:
- Icon: +2.2pp (80.5% -> 82.7%) -- icons are GUI-G2-3B's hardest split, and the one customers most need help on
- Web: +3.9pp (84.2% -> 88.1%) -- web pages have the highest density of small clickable elements
- Text: -2.3pp (96.0% -> 93.7%) -- the cost of universal CCF; mitigated by the optional
--type-gateflag
For comparison with other published 3B models on ScreenSpot-v2:
| Model | Overall | Notes |
|---|---|---|
| Jedi-3B | 88.6% | |
| UI-R1-E-3B | 89.5% | |
| GUI-G2-3B (our base) | 89.2% | |
| GUI-G2-3B + CCF (this) | 88.9% with +2.2pp on icons | Inference-time only; no extra training |
| GUI-Actor-3B | 91.0% | (closed) |
Quickstart
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from cursor_ccf import CCFConfig, ccf_predict_bbox, classify_instruction
import torch
# Load the base model exactly as you would normally
model_id = "inclusionAI/GUI-G2-3B"
processor = AutoProcessor.from_pretrained(
model_id, min_pixels=3136, max_pixels=12_845_056,
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
model.eval()
def predict_gui_g2(image, instruction):
"""Single forward pass returning ((cx, cy), raw_text). The prompt
matches GUI-G2's training format exactly so output coords are in
the processor's smart_resize space; we rescale to the original image."""
from qwen_vl_utils import process_vision_info
import re
prompt = (
"Outline the position corresponding to the instruction: {}. "
"The output should be only [x1,y1,x2,y2]."
).format(instruction)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, padding=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=32, do_sample=False)
response = processor.batch_decode(
[output[0][inputs.input_ids.shape[1]:]],
skip_special_tokens=True,
)[0]
m = re.search(r"\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]", response)
if not m:
return (None, None), response
x1, y1, x2, y2 = map(int, m.groups())
abs_cx, abs_cy = (x1 + x2) / 2, (y1 + y2) / 2
# Rescale from processed-pixel space back to original-image pixels
proc_w = inputs["image_grid_thw"][0][2].item() * 14
proc_h = inputs["image_grid_thw"][0][1].item() * 14
orig_w, orig_h = image.size
return (abs_cx * orig_w / proc_w, abs_cy * orig_h / proc_h), response
# Plug into CCF
def predict_with_ccf(image, instruction, type_gate=True):
cfg = CCFConfig(
zoom_factor=2.0,
coarse_max_pixels=1_500_000,
instruction_classifier_fn=classify_instruction if type_gate else None,
)
def inner(img, instr):
(x, y), raw = predict_gui_g2(img, instr)
return (x, y) if x is not None else None, raw
result = ccf_predict_bbox(inner, image, instruction, cfg)
if result is None:
return None
return (result.x, result.y), result.stage
# Run it
image = Image.open("screenshot.png").convert("RGB")
(x, y), stage = predict_with_ccf(image, "click the settings icon")
print(f"Click at ({x:.0f}, {y:.0f}) [stage={stage}]")
See predict.py in this repo for a complete runnable example.
When to use which configuration
- Plain CCF (
type_gate=False) — best for icon-heavy workloads (mobile app screenshots, dense web UIs). Maximum icon recall. - CCF + type-gate (
type_gate=True) — best for mixed text/icon workloads. Recovers the 1.2pp mobile loss at the cost of slightly lower web. Recommended default. - No CCF (just the base model) — best for latency-critical paths where the +2.2pp icon win isn't worth a 2x inference cost.
Latency
CCF doubles inference time per sample (two forward passes). The coarse_max_pixels=1500000 setting brings the cost back closer to 1.3-1.5x baseline rather than 2x. On an H200 with flash-attn:
| Setup | Per-sample time (1920x1080 image) |
|---|---|
| Base model | ~3-5s |
| Base + CCF (full-res coarse) | ~8-15s |
| Base + CCF (coarse downsize, recommended) | ~5-9s |
Files in this repo
| File | Purpose |
|---|---|
cursor_ccf.py |
Core CCF logic + the type-aware classifier. Pure Python + PIL, no torch dependency for the math. |
predict.py |
Self-contained runnable example: loads GUI-G2-3B, applies CCF, prints predictions. |
requirements.txt |
Pinned dependency versions known to work with the model. |
Methodology notes (the engineering, not just the math)
The Phase 4 result in this repo is the only "ours" finding from a 9-phase project that strictly improved on the GUI-G2-3B base. Across that project we also tried:
- Multi-step cursor-movement RL (GUI-Cursor paper replication): -15pp at 3B scale
- Bbox SFT on 6K mixed-source samples: -7pp (catastrophic forgetting)
- 7B teacher distillation (GUI-G2-7B -> 3B): -1.5pp overall, +4.6pp web, +2.2pp icon, -4.5pp text
The pattern across all training experiments was the same: the hard splits (icon, web) improved at the cost of the easy splits (text), and overall accuracy never beat the base. The lesson we kept: at 3B + a few-thousand-sample fine-tuning budget, GUI-G2-3B is near its achievable optimum. Inference-time wraps like CCF that don't touch the weights win the hard splits without paying the easy-split tax.
Full project writeup with per-experiment numbers: see benchmarks/results.md in the GitHub repo.
Citation
@misc{guig2_3b_ccf,
title = {GUI-G2-3B + CCF: inference-time icon refinement for screen grounding},
author = {Moncer, Luis F.},
year = {2026},
note = {Inference-time wrapper around inclusionAI/GUI-G2-3B; technique generalized from arXiv:2509.21552 (GUI-Cursor)}
}
@misc{guig2,
title = {GUI-G2-3B},
author = {inclusionAI},
year = {2025},
url = {https://huggingface.co/inclusionAI/GUI-G2-3B}
}
@misc{guicursor,
title = {GUI-Cursor: Cursor-Centric Focusing for GUI Grounding via Multi-Step RL},
year = {2025},
eprint = {2509.21552},
archiveprefix = {arXiv}
}
License
Apache 2.0 (matches the base GUI-G2-3B license).
Model tree for luisf-mc/gui-g2-3b-ccf
Base model
inclusionAI/GUI-G2-3B