File size: 4,138 Bytes

---
license: apache-2.0
datasets:
- GUI-Libra/GUI-Libra-81K-RL
- GUI-Libra/GUI-Libra-81K-SFT
language:
- en
base_model:
- Qwen/Qwen3-VL-4B-Instruct
tags:
- VLM
- GUI
- agent
---

# Introduction

The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".


**GitHub:** https://github.com/GUI-Libra/GUI-Libra
**Website:** https://GUI-Libra.github.io  


# Usage
## 1) Start an OpenAI-compatible vLLM server

```bash
pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
````

* Endpoint: `http://localhost:8000/v1`
* The `api_key` here must match `--api-key`.


## 2) Minimal Python example (prompt + image → request)

Install dependencies:

```bash
pip install -U openai
```

Create `minimal_infer.py`:

```python
import base64
from openai import OpenAI

MODEL = "GUI-Libra/GUI-Libra-4B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")

def b64_image(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# 1) Your screenshot path
img_b64 = b64_image("screen.png")

system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
    ## Explanation: Tap or click a specific UI element and provide its coordinates

action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
    ## Explanation: Select an item from a list or dropdown menu

action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
    ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None

action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
    ## Explanation: Press a specified key on the keyboard

action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
    ## Explanation: Scroll a view or container in the specified direction
"""

# 2) Your prompt (instruction + desired output format)

task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
query = question_description.format(img_size_string, task_desc, prev_txt)

query = query + '\n' + '''The response should be structured in the following format:
<thinking>Your step-by-step thought process here...</thinking>
<answer>
{
  "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
  "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
  "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
  "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''

resp = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful GUI agent."},
        {"role": "user", "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
            {"type": "text", "text": prompt},
        ]},
    ],
    temperature=0.0,
    max_completion_tokens=1024,
)

print(resp.choices[0].message.content)
```

Run:

```bash
python minimal_infer.py
```

---

## Notes

* Replace `screen.png` with your own screenshot file.
* If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
* The example assumes your vLLM server is running locally on port `8000`.