Papers
arxiv:2604.01676

GPA: Learning GUI Process Automation from Demonstrations

Published on Apr 2
· Submitted by
Jun Hao Liew
on Apr 3
Authors:
,
,
,
,
,

Abstract

GUI Process Automation (GPA) offers robust, deterministic, and privacy-preserving vision-based robotic process automation with faster execution than current vision-language model approaches.

AI-generated summary

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

Community

Paper author Paper submitter

GPA Workflow

GUI Process Automation (GPA) - from Salesforce AI Research

For product or enterprise use cases, please contact zirui.zhao@salesforce.com or junnan.li@salesforce.com.

What is GPA?

GPA is a demo-based RPA (Robotic Process Automation) framework for automating desktop GUI tasks on macOS.

The core idea: record a workflow once, replay it reliably — even when the UI changes slightly. Unlike traditional RPA tools that rely on pixel coordinates or brittle selectors, GPA uses lightweight local models to understand UI structure and locate elements robustly at replay time.

Key capabilities:

  • Record: Capture a GUI workflow as a sequence of user actions and screenshots
  • Build: LLM-powered analysis (done once at build time) converts the recording into a parameterized workflow template with named variables
  • Run: Action execution uses small-scale, locally-running visual detectors and feature extractors — no large vision-language model required at runtime. UI elements are matched via efficient embedding-based retrieval, keeping replay fast and privacy-preserving
  • Variables: Workflows accept runtime variable overrides (e.g. filenames, text content, search terms)
  • Loops: Run partial step ranges to support batched or iterative workflows

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01676 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01676 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01676 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.