SAV-GSP: Speculative Action Verification with GUI State Prediction

A research project applying speculative decoding principles to GUI agents for faster computer-use automation.

Comprehensive Results

Key Results

Grounding Performance (ScreenSpot-Pro, n=100)

Metric Baseline Fine-tuned Change
Grounding Accuracy (dist<0.1) 39.0% 36.0% -3.0%
Mean Distance Error 0.242 0.256 +5.8%
Median Distance Error 0.125 0.165 +32%
IoU@50 3.0% 3.0% -
Coord Accuracy @5% 21.0% 18.0% -3.0%
Coord Accuracy @10% 39.0% 36.0% -3.0%
Action Type Accuracy 92.0% 94.0% +2.0%
Inference Time 3.61s 3.69s +2.2%
Tokens/Action 65.9 66.1 +0.3%

Speculation Acceptance Rate

Depth K Baseline SAR Fine-tuned SAR Improvement
K=1 90% 100% +10%
K=2 40% 100% +60%
K=3 20% 20% 0%

Key Finding: Fine-tuning dramatically improves SAR for speculation depth K≀2.

Speculation Results

Per-Category Accuracy

Element Type Baseline Fine-tuned
Icon 45.8% 41.7%
Text 32.7% 30.8%

Standard Metrics (Common in GUI Agent Papers)

Metric Description Our Result
Task Success Rate % tasks completed N/A (single-step)
Step Success Rate Per-action accuracy 39% (baseline)
Grounding Accuracy UI element localization 39%
Action Type Accuracy Correct action prediction 92%
Mean Distance Error Avg coord error (normalized) 0.242
IoU@50 % with IoU > 0.5 3%
Actions Per Second Throughput 0.277
Effective Batch Size Useful actions per inference 1.4 (K=2)

Installation

pip install transformers peft accelerate bitsandbytes pillow torch

Quick Start

Loading the Model

from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForVision2Seq.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    trust_remote_code=True
)

# Load and merge LoRA
model = PeftModel.from_pretrained(model, "mariklolik228/sav-gsp-draft-lora")
model = model.merge_and_unload()

GUI Element Grounding

from PIL import Image
import re

def locate_element(image_path: str, instruction: str) -> tuple:
    image = Image.open(image_path).convert("RGB")
    
    prompt = f'''Look at this screenshot. Find and locate the element described below.

Element to find: {instruction}

Output your answer as coordinates in the format: (x, y) where x and y are normalized values between 0 and 1.

Coordinates:'''

    messages = [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    
    response = processor.decode(outputs[0], skip_special_tokens=True)
    
    match = re.search(r'\(([0-9.]+),\s*([0-9.]+)\)', response)
    if match:
        return float(match.group(1)), float(match.group(2))
    return None

# Usage
coords = locate_element("screenshot.png", "close button")

Multi-Action Prediction

def predict_actions(image_path: str, task: str, k: int = 3):
    image = Image.open(image_path).convert("RGB")
    
    prompt = f'''Predict the next {k} actions to complete: {task}

For each action output: action_type, (x, y)'''

    messages = [{"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": prompt}
    ]}]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False)
    
    return processor.decode(outputs[0], skip_special_tokens=True)

Comparison with SOTA

Method ScreenSpot-Pro OSWorld-G Notes
MAI-UI (SOTA, 2025) 73.5% 70.9% RL-optimized
SAV-GSP (ours) 39.0% - Zero-shot
Claude 3.5 Sonnet ~35% ~25% General VLM
Human ~95% ~72% Estimated

Limitations

  1. Grounding Gap: 34.5% below SOTA due to training on action prediction rather than grounding
  2. IoU Performance: Poor bounding box alignment (3% IoU@50)
  3. Platform: Trained primarily on Windows; Android/iOS/Web may vary
  4. Speculation Depth: K>2 shows no improvement

Training Details

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Method: LoRA (r=16, Ξ±=32, dropout=0.05)
  • Data: ScreenSpot-Pro subset (~200 samples)
  • Epochs: 3
  • Batch Size: 2 (grad accum: 8)
  • Learning Rate: 2e-4
  • Hardware: NVIDIA GPU, 4-bit quantization

Files

β”œβ”€β”€ adapter_config.json      # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors # LoRA weights
β”œβ”€β”€ README.md                 # This file
└── figures/
    β”œβ”€β”€ comparison.png        # Comprehensive results
    β”œβ”€β”€ grounding_accuracy.png
    └── speculation_depth.png

Citation

@misc{sav-gsp-2026,
  title={SAV-GSP: Speculative Action Verification with GUI State Prediction},
  author={Research Team},
  year={2026},
  howpublished={\url{https://huggingface.co/mariklolik228/sav-gsp-draft-lora}}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mariklolik228/sav-gsp-draft-lora

Adapter
(179)
this model