Safetensors
English
qwen2_5_vl
VLM
GUI
agent
GUI-Libra-7B / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag, library metadata, and paper link
2170d02 verified
|
raw
history blame
5.48 kB
metadata
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct
datasets:
  - GUI-Libra/GUI-Libra-81K-RL
  - GUI-Libra/GUI-Libra-81K-SFT
language:
  - en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - VLM
  - GUI
  - agent

GUI-Libra-7B

GUI-Libra is a native GUI agent model designed to reason and act based on UI screenshots. It is presented in the paper GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL.

GitHub: GUI-Libra/GUI-Libra
Website: GUI-Libra Project Page

Introduction

GUI-Libra is a post-training framework that transforms open-source VLMs into strong native GUI agents. These models can perceive a screenshot, think step-by-step using Chain-of-Thought (CoT), and output executable actions within a single forward pass. This version is based on the Qwen2.5-VL-7B-Instruct architecture.

The framework addresses two main challenges in GUI agent training:

  1. Scarcity of action-aligned reasoning data: Mitigated by the GUI-Libra-81K dataset.
  2. Grounding vs. Reasoning: Solved via action-aware SFT that balances thought process with coordinate accuracy.
  3. Partial Verifiability: Addressed using conservative RL (KL-regularized GRPO).

Usage

1) Start an OpenAI-compatible vLLM server

pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-7B --port 8000 --api-key token-abc123
  • Endpoint: http://localhost:8000/v1
  • The api_key here must match --api-key.

2) Minimal Python example (prompt + image → request)

Install dependencies:

pip install -U openai

Create minimal_infer.py:

import base64
from openai import OpenAI

MODEL = "GUI-Libra/GUI-Libra-7B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

def b64_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# 1) Your screenshot path
img_b64 = b64_image("screen.png")

system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
    ## Explanation: Tap or click a specific UI element and provide its coordinates

action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
    ## Explanation: Select an item from a list or dropdown menu

action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
    ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None

action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
    ## Explanation: Press a specified key on the keyboard

action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
    ## Explanation: Scroll a view or container in the specified direction
"""

# 2) Your prompt (instruction + desired output format)

task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
# Note: replace img_size with your screenshot dimensions, e.g., [1920, 1080]
img_size = [1920, 1080] 
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.

Instruction: {}

Interaction History: {}
'''
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
query = question_description.format(img_size_string, task_desc, prev_txt)

query = query + '
' + '''The response should be structured in the following format:
<think>Your step-by-step thought process here...</think>
<answer>
{
  "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
  "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
  "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
  "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''

resp = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful GUI agent."},
        {"role": "user", "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
            {"type": "text", "text": query},
        ]},
    ],
    temperature=0.0,
    max_completion_tokens=1024,
)

print(resp.choices[0].message.content)

Citation

@misc{yang2026guilibratrainingnativegui,
      title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL}, 
      author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang},
      year={2026},
      eprint={2602.22190},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.22190}, 
}