SAV-GSP: Speculative Action Verification with GUI State Prediction

A research project applying speculative decoding principles to GUI agents for faster computer-use automation.

Key Results

Grounding Performance (ScreenSpot-Pro, n=100)

Metric	Baseline	Fine-tuned	Change
Grounding Accuracy (dist<0.1)	39.0%	36.0%	-3.0%
Mean Distance Error	0.242	0.256	+5.8%
Median Distance Error	0.125	0.165	+32%
IoU@50	3.0%	3.0%	-
Coord Accuracy @5%	21.0%	18.0%	-3.0%
Coord Accuracy @10%	39.0%	36.0%	-3.0%
Action Type Accuracy	92.0%	94.0%	+2.0%
Inference Time	3.61s	3.69s	+2.2%
Tokens/Action	65.9	66.1	+0.3%

Speculation Acceptance Rate

Depth K	Baseline SAR	Fine-tuned SAR	Improvement
K=1	90%	100%	+10%
K=2	40%	100%	+60%
K=3	20%	20%	0%

Key Finding: Fine-tuning dramatically improves SAR for speculation depth K≤2.

Per-Category Accuracy

Element Type	Baseline	Fine-tuned
Icon	45.8%	41.7%
Text	32.7%	30.8%

Standard Metrics (Common in GUI Agent Papers)

Metric	Description	Our Result
Task Success Rate	% tasks completed	N/A (single-step)
Step Success Rate	Per-action accuracy	39% (baseline)
Grounding Accuracy	UI element localization	39%
Action Type Accuracy	Correct action prediction	92%
Mean Distance Error	Avg coord error (normalized)	0.242
IoU@50	% with IoU > 0.5	3%
Actions Per Second	Throughput	0.277
Effective Batch Size	Useful actions per inference	1.4 (K=2)

Installation

pip install transformers peft accelerate bitsandbytes pillow torch

Quick Start

Loading the Model

from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForVision2Seq.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    trust_remote_code=True
)

# Load and merge LoRA
model = PeftModel.from_pretrained(model, "mariklolik228/sav-gsp-draft-lora")
model = model.merge_and_unload()

GUI Element Grounding

from PIL import Image
import re

def locate_element(image_path: str, instruction: str) -> tuple:
    image = Image.open(image_path).convert("RGB")
    
    prompt = f'''Look at this screenshot. Find and locate the element described below.

Element to find: {instruction}

Output your answer as coordinates in the format: (x, y) where x and y are normalized values between 0 and 1.

Coordinates:'''

    messages = [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    
    response = processor.decode(outputs[0], skip_special_tokens=True)
    
    match = re.search(r'\(([0-9.]+),\s*([0-9.]+)\)', response)
    if match:
        return float(match.group(1)), float(match.group(2))
    return None

# Usage
coords = locate_element("screenshot.png", "close button")

Multi-Action Prediction

def predict_actions(image_path: str, task: str, k: int = 3):
    image = Image.open(image_path).convert("RGB")
    
    prompt = f'''Predict the next {k} actions to complete: {task}

For each action output: action_type, (x, y)'''

    messages = [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)
    
    return processor.decode(outputs[0], skip_special_tokens=True)

Comparison with SOTA

Method	ScreenSpot-Pro	OSWorld-G	Notes
MAI-UI (SOTA, 2025)	73.5%	70.9%	RL-optimized
SAV-GSP (ours)	39.0%	-	Zero-shot
Claude 3.5 Sonnet	~35%	~25%	General VLM
Human	~95%	~72%	Estimated

Limitations

Grounding Gap: 34.5% below SOTA due to training on action prediction rather than grounding
IoU Performance: Poor bounding box alignment (3% IoU@50)
Platform: Trained primarily on Windows; Android/iOS/Web may vary
Speculation Depth: K>2 shows no improvement

Training Details

Base Model: Qwen2.5-VL-7B-Instruct
Method: LoRA (r=16, α=32, dropout=0.05)
Data: ScreenSpot-Pro subset (~200 samples)
Epochs: 3
Batch Size: 2 (grad accum: 8)
Learning Rate: 2e-4
Hardware: NVIDIA GPU, 4-bit quantization

Files

├── adapter_config.json      # LoRA configuration
├── adapter_model.safetensors # LoRA weights
├── README.md                 # This file
└── figures/
    ├── comparison.png        # Comprehensive results
    ├── grounding_accuracy.png
    └── speculation_depth.png

Citation

@misc{sav-gsp-2026,
  title={SAV-GSP: Speculative Action Verification with GUI State Prediction},
  author={Research Team},
  year={2026},
  howpublished={\url{https://huggingface.co/mariklolik228/sav-gsp-draft-lora}}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mariklolik228/sav-gsp-draft-lora

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(306)

this model