arxiv:2604.01676

GPA: Learning GUI Process Automation from Demonstrations

Published on Apr 2

· Submitted by

Jun Hao Liew on Apr 3

Salesforce

Upvote

Authors:

Zirui Zhao ,

Jun Hao Liew ,

Abstract

GUI Process Automation (GPA) offers robust, deterministic, and privacy-preserving vision-based robotic process automation with faster execution than current vision-language model approaches.

AI-generated summary

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

View arXiv page View PDF Project page Add to collection

Community

junhao910323

Paper author Paper submitter about 19 hours ago

GPA Workflow

GUI Process Automation (GPA) - from Salesforce AI Research

For product or enterprise use cases, please contact zirui.zhao@salesforce.com or junnan.li@salesforce.com.

What is GPA?

GPA is a demo-based RPA (Robotic Process Automation) framework for automating desktop GUI tasks on macOS.

The core idea: record a workflow once, replay it reliably — even when the UI changes slightly. Unlike traditional RPA tools that rely on pixel coordinates or brittle selectors, GPA uses lightweight local models to understand UI structure and locate elements robustly at replay time.

Key capabilities:

Record: Capture a GUI workflow as a sequence of user actions and screenshots
Build: LLM-powered analysis (done once at build time) converts the recording into a parameterized workflow template with named variables
Run: Action execution uses small-scale, locally-running visual detectors and feature extractors — no large vision-language model required at runtime. UI elements are matched via efficient embedding-based retrieval, keeping replay fast and privacy-preserving
Variables: Workflows accept runtime variable overrides (e.g. filenames, text content, search terms)
Loops: Run partial step ranges to support batched or iterative workflows

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01676 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01676 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01676 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.