GUIrilla: A Scalable Framework for Automated Desktop UI Exploration
Paper
•
2510.16051
•
Published
Lightweight vision–language model for GUI element localisation
GUIrilla-See-0.7B is a 0.7-billion-parameter model derived from Florence 2-large and fine-tuned for open-vocabulary detection in graphical user-interface (GUI) screenshots. Given an image and a free-form textual description, the model returns either
The model is intended for research on lightweight GUI agents, automated testing, and accessibility tools where a small footprint is preferred over the larger counterpart.
import torch, PIL.Image as Image
from transformers import AutoModelForCausalLM, AutoProcessor
# --- load pipeline -----------------------------------------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "MacPaw/GUIrilla-See-0.7B" # 0.7 B weights
dtype = torch.bfloat16 if device == "cuda" else torch.float32
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=dtype, trust_remote_code=True
).to(device)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# --- inference ---------------------------------------------------------------
image = Image.open("screenshot.png").convert("RGB")
task_prompt = "<OPEN_VOCABULARY_DETECTION>"
text_query = "button with the label “Submit”"
prompt = task_prompt + text_query
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(device, dtype)
with torch.no_grad():
ids = model.generate(
input_ids = inputs["input_ids"],
pixel_values= inputs["pixel_values"],
max_new_tokens = 1024,
num_beams = 3,
do_sample = False,
early_stopping = False,
)
decoded = processor.batch_decode(ids, skip_special_tokens=False)[0]
result = processor.post_process_generation(
decoded, task=task_prompt, image_size=image.size
)["<OPEN_VOCABULARY_DETECTION>"]
Trained on GUIrilla-Task.
| Split | Success Rate % |
|---|---|
| Test | 53.55 |
MIT (see LICENSE).
@article{garkot2025guirilla,
title={GUIrilla: A Scalable Framework for Automated Desktop UI Exploration},
author={Garkot, Sofiya and Shamrai, Maksym and Synytsia, Ivan and Hirna, Mariya},
journal={arXiv preprint arXiv:2510.16051},
year={2025},
url={https://arxiv.org/abs/2510.16051}
}
Base model
microsoft/Florence-2-large