Image-Text-to-Text
Transformers
PyTorch
Safetensors
florence2
GUI
VLM
Agent
GUI-Grounding
custom_code
Instructions to use HongxinLi/GoClick-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HongxinLi/GoClick-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="HongxinLi/GoClick-Base", trust_remote_code=True)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True) model = AutoModelForImageTextToText.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HongxinLi/GoClick-Base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HongxinLi/GoClick-Base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HongxinLi/GoClick-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/HongxinLi/GoClick-Base
- SGLang
How to use HongxinLi/GoClick-Base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HongxinLi/GoClick-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HongxinLi/GoClick-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HongxinLi/GoClick-Base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HongxinLi/GoClick-Base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use HongxinLi/GoClick-Base with Docker Model Runner:
docker model run hf.co/HongxinLi/GoClick-Base
File size: 5,892 Bytes
05633a4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
license: mit
base_model:
- microsoft/Florence-2-large
library_name: transformers
tags:
- GUI
- VLM
- Agent
- GUI-Grounding
---
# π― GoClick-Large: Super Fast Lightweight GUI Grounding Expert
<div align="center">
[](https://github.com/ZJULiHongxin/GoClick)
[](https://arxiv.org/abs/2604.23941)
[](https://huggingface.co/HongxinLi/GoClick-Large)
[](https://huggingface.co/HongxinLi/GoClick-Base)
[](https://huggingface.co/datasets/HongxinLi/GoClick_Coreset_3814k)
[](https://huggingface.co/datasets/HongxinLi/GoClick_sft_data)
</div>
GoClick is a state-of-the-art two-stage framework for precise UI element grounding. Built on the Florence-2 architecture, it bridges the gap between high-level intent and low-level pixel coordinates by separating the Planning and Grounding tasks.
## ποΈ Agent Architecture Overview
1. Stage 1 (Planning): Analyze UI screenshot + Goal -> Output Function Description.
2. Stage 2 (Grounding): Screenshot + Function Description -> Output Precise Coordinates.Note: This model is the specialized Stage 2 Grounder, fine-tuned for extreme precision in locating elements based on their described functionality.
## π Quick Start (Inference of The Model)
Prerequisites
```
pip install transformers==4.45.0 timm
```
Note: The version of Transformers should not be too high. Adjust the version if model loading fails.
### Usage Example
```
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
def postprocess(text: str, image_size: tuple[int]):
"""Function that decodes model's generation into action json.
Args:
text: single generated sample
image_size: corresponding image size
"""
point_pattern = r"<loc_(\d+)>,<loc_(\d+)>"
try:
location = re.findall(point_pattern, text)[0]
if len(location) > 0:
point = [int(loc) for loc in location]
except Exception:
point = (0, 0)
return point
# Load model and processor
model = AutoModelForCausalLM.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
# Load UI screenshot
image = Image.open("ui_screenshot.png")
# Stage 1: Planning
# Functionality Grounding (For AutoGUI FuncPred Benchmark)
planning_prompt = f"Locate the element according to its detailed functionality description. {goal_info} (Output the center coordinates of the target)"
# Intent Grounding (For RefExp, MOTIF, and VisualWebBench Action Grounding)
planning_prompt = f"I want to {goal_info}. Please locate the target element I should interact with. (Output the center coordinates of the target)"
# Description Grounding (For ScreenSpot/v2 and VisualWebBench Element Grounding))
planning_prompt = f"Where is the {goal_info} element? (Output the center coordinates of the target)"
inputs = processor(
images=image,
text=prompt,
return_tensors="pt",
do_resize=True,
).to(model.device, dtype=model.dtype)
outputs = model.generate(
**inputs,
do_sample= False,
max_new_tokens=max_new_tokens,
use_cache=True
)
text_output = processor.tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
text_output = postprocess(text_output, img_size)
```
### π Benchmarks
GoClick-Base also achieves a good tradeoff between GUI element grounding accuracy and inference latency:
| Model | Size | TTFT β (ms) | TPOT β (ms/token) | FuncPred (F; M, W) | ScreenSpot (B; M, W, D) | ScreenSpot-v2 (B; M, W, D) | MOTIF (I; M) | RefExp (I; M) | VWB EG (T; W) | VWB AG (I; W) |
|-------|------|-------------|-------------------|--------------------|-------------------------|---------------------------|--------------|---------------|---------------|---------------|
| GPT-4o | - | - | - | 9.8 | 17.8 | 20.4 | 30.5 | 21.8 | 5.6 | 6.8 |
| Qwen2VL-7B | 8B | 118.9 | 21.2 | 38.7 | 66.4 | 66.9 | 75.1 | 64.8 | 55.9 | 62.1 |
| CogAgent | 18B | 1253.2 | 208.8 | 29.3 | 47.4 | 49.2 | 46.7 | 35.0 | 55.7 | 59.2 |
| SeeClick | 10B | 160.4 | 184.4 | 19.8 | 53.4 | 54.0 | 11.1 | 58.1 | 39.2 | 27.2 |
| Ferret-UI | 8B | 152.5 | 22.9 | 1.2 | 7.1 | 7.8 | 15.9 | 5.5 | 3.9 | 1.9 |
| UGround | 7B | 1034.6 | 27.9 | 48.8 | 74.8 | 76.5 | 72.4 | 73.6 | 85.2 | 63.1 |
| OS-ATLAS-8B | 8B | 137.5 | 19.9 | 52.1 | 82.5 | 84.1 | 78.8 | 66.5 | 82.6 | 69.9 |
| Aguvis | 8B | 119.7 | 21.2 | 52.0 | 83.8 | 85.6 | 73.8 | 80.9 | 91.3 | 68.0 |
| Qwen2-VL | 2B | 58.8 | 16.4 | 7.1 | 17.9 | 18.6 | 28.8 | 29.2 | 17.9 | 17.5 |
| OS-ATLAS-4B | 4B | 137.3 | 31.4 | 44.6 | 66.8 | 68.7 | 75.4 | 77.1 | 47.7 | 58.3 |
| Ferret-UI | 3B | 69.5 | 9.8 | 1.3 | 2.1 | 1.9 | 5.5 | 1.1 | 0.7 | 1.0 |
| ShowUI | 2B | 79.7 | 14.7 | 39.9 | 76.1 | 77.4 | 72.3 | 58.4 | 64.2 | 55.3 |
| **GoClick-L (ours)** | 0.8B | 91.1 | 8.3 | **69.5** | **78.5** | **81.1** | **80.4** | **78.2** | **90.3** | **68.0** |
| **GoClick-B (ours)** | 0.2B | **37.7** | **4.1** | 64.4 | 74.1 | 75.2 | 76.8 | 71.9 | 90.3 | 61.2 |
## π Citation
If you use GoClick in your research, please cite our paper:
```
@misc{li2026goclicklightweightelementgrounding,
title={GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction},
author={Hongxin Li and Yuntao Chen and Zhaoxiang Zhang},
year={2026},
eprint={2604.23941},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.23941},
}
```
|