SAV-GSP: Speculative Action Verification with GUI State Prediction
A research project applying speculative decoding principles to GUI agents for faster computer-use automation.
Key Results
Grounding Performance (ScreenSpot-Pro, n=100)
| Metric | Baseline | Fine-tuned | Change |
|---|---|---|---|
| Grounding Accuracy (dist<0.1) | 39.0% | 36.0% | -3.0% |
| Mean Distance Error | 0.242 | 0.256 | +5.8% |
| Median Distance Error | 0.125 | 0.165 | +32% |
| IoU@50 | 3.0% | 3.0% | - |
| Coord Accuracy @5% | 21.0% | 18.0% | -3.0% |
| Coord Accuracy @10% | 39.0% | 36.0% | -3.0% |
| Action Type Accuracy | 92.0% | 94.0% | +2.0% |
| Inference Time | 3.61s | 3.69s | +2.2% |
| Tokens/Action | 65.9 | 66.1 | +0.3% |
Speculation Acceptance Rate
| Depth K | Baseline SAR | Fine-tuned SAR | Improvement |
|---|---|---|---|
| K=1 | 90% | 100% | +10% |
| K=2 | 40% | 100% | +60% |
| K=3 | 20% | 20% | 0% |
Key Finding: Fine-tuning dramatically improves SAR for speculation depth Kβ€2.
Per-Category Accuracy
| Element Type | Baseline | Fine-tuned |
|---|---|---|
| Icon | 45.8% | 41.7% |
| Text | 32.7% | 30.8% |
Standard Metrics (Common in GUI Agent Papers)
| Metric | Description | Our Result |
|---|---|---|
| Task Success Rate | % tasks completed | N/A (single-step) |
| Step Success Rate | Per-action accuracy | 39% (baseline) |
| Grounding Accuracy | UI element localization | 39% |
| Action Type Accuracy | Correct action prediction | 92% |
| Mean Distance Error | Avg coord error (normalized) | 0.242 |
| IoU@50 | % with IoU > 0.5 | 3% |
| Actions Per Second | Throughput | 0.277 |
| Effective Batch Size | Useful actions per inference | 1.4 (K=2) |
Installation
pip install transformers peft accelerate bitsandbytes pillow torch
Quick Start
Loading the Model
from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForVision2Seq.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
trust_remote_code=True
)
# Load and merge LoRA
model = PeftModel.from_pretrained(model, "mariklolik228/sav-gsp-draft-lora")
model = model.merge_and_unload()
GUI Element Grounding
from PIL import Image
import re
def locate_element(image_path: str, instruction: str) -> tuple:
image = Image.open(image_path).convert("RGB")
prompt = f'''Look at this screenshot. Find and locate the element described below.
Element to find: {instruction}
Output your answer as coordinates in the format: (x, y) where x and y are normalized values between 0 and 1.
Coordinates:'''
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
response = processor.decode(outputs[0], skip_special_tokens=True)
match = re.search(r'\(([0-9.]+),\s*([0-9.]+)\)', response)
if match:
return float(match.group(1)), float(match.group(2))
return None
# Usage
coords = locate_element("screenshot.png", "close button")
Multi-Action Prediction
def predict_actions(image_path: str, task: str, k: int = 3):
image = Image.open(image_path).convert("RGB")
prompt = f'''Predict the next {k} actions to complete: {task}
For each action output: action_type, (x, y)'''
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)
return processor.decode(outputs[0], skip_special_tokens=True)
Comparison with SOTA
| Method | ScreenSpot-Pro | OSWorld-G | Notes |
|---|---|---|---|
| MAI-UI (SOTA, 2025) | 73.5% | 70.9% | RL-optimized |
| SAV-GSP (ours) | 39.0% | - | Zero-shot |
| Claude 3.5 Sonnet | ~35% | ~25% | General VLM |
| Human | ~95% | ~72% | Estimated |
Limitations
- Grounding Gap: 34.5% below SOTA due to training on action prediction rather than grounding
- IoU Performance: Poor bounding box alignment (3% IoU@50)
- Platform: Trained primarily on Windows; Android/iOS/Web may vary
- Speculation Depth: K>2 shows no improvement
Training Details
- Base Model: Qwen2.5-VL-7B-Instruct
- Method: LoRA (r=16, Ξ±=32, dropout=0.05)
- Data: ScreenSpot-Pro subset (~200 samples)
- Epochs: 3
- Batch Size: 2 (grad accum: 8)
- Learning Rate: 2e-4
- Hardware: NVIDIA GPU, 4-bit quantization
Files
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights
βββ README.md # This file
βββ figures/
βββ comparison.png # Comprehensive results
βββ grounding_accuracy.png
βββ speculation_depth.png
Citation
@misc{sav-gsp-2026,
title={SAV-GSP: Speculative Action Verification with GUI State Prediction},
author={Research Team},
year={2026},
howpublished={\url{https://huggingface.co/mariklolik228/sav-gsp-draft-lora}}
}
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for mariklolik228/sav-gsp-draft-lora
Base model
Qwen/Qwen2.5-VL-7B-Instruct
