File size: 5,158 Bytes

aff023a
618f2ab
 
aff023a
 
 
 
 
618f2ab
 
 
aff023a
 
 
 
 
 
618f2ab
 
 
aff023a
618f2ab
aff023a
618f2ab
aff023a
618f2ab
aff023a
618f2ab
 
 
 
aff023a
618f2ab
 
 
aff023a
 
 
 
618f2ab
aff023a
618f2ab
 
aff023a
618f2ab
aff023a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
618f2ab
 
 
 
aff023a
618f2ab
 
 
 
 
 
aff023a
 
 
 
 
 
 
 
 
 
 
 
 
618f2ab
aff023a
 
 
618f2ab
aff023a
 
 
 
 
 
 
 
 
618f2ab

---
base_model:
- Qwen/Qwen3-VL-8B-Instruct
datasets:
- GUI-Libra/GUI-Libra-81K-RL
- GUI-Libra/GUI-Libra-81K-SFT
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- VLM
- GUI
- agent
---

# GUI-Libra-8B

[**Project Page**](https://GUI-Libra.github.io) | [**Paper**](https://huggingface.co/papers/2602.22190) | [**GitHub**](https://github.com/GUI-Libra/GUI-Libra)

GUI-Libra-8B is a native GUI agent model fine-tuned from [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct). It is designed to perceive screenshots, reason step-by-step, and output executable actions in a single forward pass.

The model is introduced in the paper [GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL](https://huggingface.co/papers/2602.22190).

## Introduction

GUI-Libra addresses key limitations in open-source GUI agents through three main contributions:
1.  **GUI-Libra-81K**: A curated reasoning dataset with 81,000 steps.
2.  **Action-Aware SFT**: A training strategy that balances chain-of-thought reasoning with visual grounding accuracy.
3.  **Conservative RL**: A KL-regularized GRPO approach tailored for GUI environments where rewards are only partially verifiable.

## Usage

### 1) Start an OpenAI-compatible vLLM server

```bash
pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-8B --port 8000 --api-key token-abc123
```

*   Endpoint: `http://localhost:8000/v1`
*   The `api_key` here must match `--api-key`.

### 2) Minimal Python example

Install dependencies:
```bash
pip install -U openai
```

Create `minimal_infer.py`:

```python
import base64
from openai import OpenAI

MODEL = "GUI-Libra/GUI-Libra-8B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

def b64_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# 1) Your screenshot path
img_b64 = b64_image("screen.png")

system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
    ## Explanation: Tap or click a specific UI element and provide its coordinates

action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
    ## Explanation: Select an item from a list or dropdown menu

action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
    ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None

action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
    ## Explanation: Press a specified key on the keyboard

action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
    ## Explanation: Scroll a view or container in the specified direction
"""

# 2) Your prompt (instruction + desired output format)
task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
# Note: Ensure img_size is defined or use default
question_description = '''Please generate the next move according to the UI screenshot, instruction and previous actions.

Instruction: {}

Interaction History: {}
'''
query = question_description.format(task_desc, prev_txt)

query = query + '
' + '''The response should be structured in the following format:
<thinking>Your step-by-step thought process here...</thinking>
<answer>
{
  "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
  "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
  "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
  "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''

resp = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
            {"type": "text", "text": query},
        ]},
    ],
    temperature=0.0,
    max_completion_tokens=1024,
)

print(resp.choices[0].message.content)
```

## Citation

```bibtex
@misc{yang2026guilibratrainingnativegui,
      title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL}, 
      author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang},
      year={2026},
      eprint={2602.22190},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.22190}, 
}
```