Instructions to use HongxinLi/GoClick-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HongxinLi/GoClick-Base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="HongxinLi/GoClick-Base", trust_remote_code=True)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HongxinLi/GoClick-Base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HongxinLi/GoClick-Base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HongxinLi/GoClick-Base

SGLang

How to use HongxinLi/GoClick-Base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HongxinLi/GoClick-Base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HongxinLi/GoClick-Base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HongxinLi/GoClick-Base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HongxinLi/GoClick-Base with Docker Model Runner:
```
docker model run hf.co/HongxinLi/GoClick-Base
```

GoClick-Base / README.md

HongxinLi

Create README.md

05633a4 verified 7 days ago

preview code

raw

history blame contribute delete

5.89 kB

	---
	license: mit
	base_model:
	- microsoft/Florence-2-large
	library_name: transformers
	tags:
	- GUI
	- VLM
	- Agent
	- GUI-Grounding
	---


	# 🎯 GoClick-Large: Super Fast Lightweight GUI Grounding Expert


	<div align="center">

	[![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/ZJULiHongxin/GoClick)
	[![Paper](https://img.shields.io/badge/Paper-GoClick-blue?logo=adobeacrobatreader)](https://arxiv.org/abs/2604.23941)
	[![GoClickLarge](https://img.shields.io/badge/🤗%20GoClickLarge-Model-yellow)](https://huggingface.co/HongxinLi/GoClick-Large)
	[![GoClickBase](https://img.shields.io/badge/🤗%20GoClickBase-Model-yellow)](https://huggingface.co/HongxinLi/GoClick-Base)
	[![SFTData](https://img.shields.io/badge/🤗%20SFT-Dataset-yellow)](https://huggingface.co/datasets/HongxinLi/GoClick_Coreset_3814k)
	[![SFTZipData](https://img.shields.io/badge/🤗%20SFTZip-SFTData-yellow)](https://huggingface.co/datasets/HongxinLi/GoClick_sft_data)

	</div>


	GoClick is a state-of-the-art two-stage framework for precise UI element grounding. Built on the Florence-2 architecture, it bridges the gap between high-level intent and low-level pixel coordinates by separating the Planning and Grounding tasks.

	## 🏗️ Agent Architecture Overview

	1. Stage 1 (Planning): Analyze UI screenshot + Goal -> Output Function Description.
	2. Stage 2 (Grounding): Screenshot + Function Description -> Output Precise Coordinates.Note: This model is the specialized Stage 2 Grounder, fine-tuned for extreme precision in locating elements based on their described functionality.

	## 🚀 Quick Start (Inference of The Model)

	Prerequisites

	```
	pip install transformers==4.45.0 timm
	```

	Note: The version of Transformers should not be too high. Adjust the version if model loading fails.

	### Usage Example

	```
	from transformers import AutoModelForCausalLM, AutoProcessor
	from PIL import Image


	def postprocess(text: str, image_size: tuple[int]):
	"""Function that decodes model's generation into action json.

	Args:
	text: single generated sample
	image_size: corresponding image size
	"""
	point_pattern = r"<loc_(\d+)>,<loc_(\d+)>"

	try:
	location = re.findall(point_pattern, text)[0]
	if len(location) > 0:
	point = [int(loc) for loc in location]

	except Exception:
	point = (0, 0)

	return point

	# Load model and processor
	model = AutoModelForCausalLM.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)
	processor = AutoProcessor.from_pretrained("HongxinLi/GoClick-Base", trust_remote_code=True)

	# Load UI screenshot
	image = Image.open("ui_screenshot.png")

	# Stage 1: Planning

	# Functionality Grounding (For AutoGUI FuncPred Benchmark)
	planning_prompt = f"Locate the element according to its detailed functionality description. {goal_info} (Output the center coordinates of the target)"

	# Intent Grounding (For RefExp, MOTIF, and VisualWebBench Action Grounding)
	planning_prompt = f"I want to {goal_info}. Please locate the target element I should interact with. (Output the center coordinates of the target)"

	# Description Grounding (For ScreenSpot/v2 and VisualWebBench Element Grounding))
	planning_prompt = f"Where is the {goal_info} element? (Output the center coordinates of the target)"


	inputs = processor(
	images=image,
	text=prompt,
	return_tensors="pt",
	do_resize=True,
	).to(model.device, dtype=model.dtype)

	outputs = model.generate(
	**inputs,
	do_sample= False,
	max_new_tokens=max_new_tokens,
	use_cache=True
	)

	text_output = processor.tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
	text_output = postprocess(text_output, img_size)

	```

	### 📊 Benchmarks

	GoClick-Base also achieves a good tradeoff between GUI element grounding accuracy and inference latency:

	\| Model \| Size \| TTFT ↓ (ms) \| TPOT ↓ (ms/token) \| FuncPred (F; M, W) \| ScreenSpot (B; M, W, D) \| ScreenSpot-v2 (B; M, W, D) \| MOTIF (I; M) \| RefExp (I; M) \| VWB EG (T; W) \| VWB AG (I; W) \|
	\|-------\|------\|-------------\|-------------------\|--------------------\|-------------------------\|---------------------------\|--------------\|---------------\|---------------\|---------------\|
	\| GPT-4o \| - \| - \| - \| 9.8 \| 17.8 \| 20.4 \| 30.5 \| 21.8 \| 5.6 \| 6.8 \|
	\| Qwen2VL-7B \| 8B \| 118.9 \| 21.2 \| 38.7 \| 66.4 \| 66.9 \| 75.1 \| 64.8 \| 55.9 \| 62.1 \|
	\| CogAgent \| 18B \| 1253.2 \| 208.8 \| 29.3 \| 47.4 \| 49.2 \| 46.7 \| 35.0 \| 55.7 \| 59.2 \|
	\| SeeClick \| 10B \| 160.4 \| 184.4 \| 19.8 \| 53.4 \| 54.0 \| 11.1 \| 58.1 \| 39.2 \| 27.2 \|
	\| Ferret-UI \| 8B \| 152.5 \| 22.9 \| 1.2 \| 7.1 \| 7.8 \| 15.9 \| 5.5 \| 3.9 \| 1.9 \|
	\| UGround \| 7B \| 1034.6 \| 27.9 \| 48.8 \| 74.8 \| 76.5 \| 72.4 \| 73.6 \| 85.2 \| 63.1 \|
	\| OS-ATLAS-8B \| 8B \| 137.5 \| 19.9 \| 52.1 \| 82.5 \| 84.1 \| 78.8 \| 66.5 \| 82.6 \| 69.9 \|
	\| Aguvis \| 8B \| 119.7 \| 21.2 \| 52.0 \| 83.8 \| 85.6 \| 73.8 \| 80.9 \| 91.3 \| 68.0 \|
	\| Qwen2-VL \| 2B \| 58.8 \| 16.4 \| 7.1 \| 17.9 \| 18.6 \| 28.8 \| 29.2 \| 17.9 \| 17.5 \|
	\| OS-ATLAS-4B \| 4B \| 137.3 \| 31.4 \| 44.6 \| 66.8 \| 68.7 \| 75.4 \| 77.1 \| 47.7 \| 58.3 \|
	\| Ferret-UI \| 3B \| 69.5 \| 9.8 \| 1.3 \| 2.1 \| 1.9 \| 5.5 \| 1.1 \| 0.7 \| 1.0 \|
	\| ShowUI \| 2B \| 79.7 \| 14.7 \| 39.9 \| 76.1 \| 77.4 \| 72.3 \| 58.4 \| 64.2 \| 55.3 \|
	\| GoClick-L (ours) \| 0.8B \| 91.1 \| 8.3 \| 69.5 \| 78.5 \| 81.1 \| 80.4 \| 78.2 \| 90.3 \| 68.0 \|
	\| GoClick-B (ours) \| 0.2B \| 37.7 \| 4.1 \| 64.4 \| 74.1 \| 75.2 \| 76.8 \| 71.9 \| 90.3 \| 61.2 \|


	## 📝 Citation
	If you use GoClick in your research, please cite our paper:

	```
	@misc{li2026goclicklightweightelementgrounding,
	title={GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction},
	author={Hongxin Li and Yuntao Chen and Zhaoxiang Zhang},
	year={2026},
	eprint={2604.23941},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2604.23941},
	}
	```