File size: 4,138 Bytes
aff023a 15ee285 dfbe75d aff023a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: apache-2.0
datasets:
- GUI-Libra/GUI-Libra-81K-RL
- GUI-Libra/GUI-Libra-81K-SFT
language:
- en
base_model:
- Qwen/Qwen3-VL-8B-Instruct
tags:
- VLM
- GUI
- agent
---
# Introduction
The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
**GitHub:** https://github.com/GUI-Libra/GUI-Libra
**Website:** https://GUI-Libra.github.io
# Usage
## 1) Start an OpenAI-compatible vLLM server
```bash
pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-8B --port 8000 --api-key token-abc123
````
* Endpoint: `http://localhost:8000/v1`
* The `api_key` here must match `--api-key`.
## 2) Minimal Python example (prompt + image → request)
Install dependencies:
```bash
pip install -U openai
```
Create `minimal_infer.py`:
```python
import base64
from openai import OpenAI
MODEL = "GUI-Libra/GUI-Libra-8B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
def b64_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
# 1) Your screenshot path
img_b64 = b64_image("screen.png")
system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
## Explanation: Tap or click a specific UI element and provide its coordinates
action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
## Explanation: Select an item from a list or dropdown menu
action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
## Explanation: Enter text into a specific input field or at the current focus if coordinate is None
action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
## Explanation: Press a specified key on the keyboard
action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
## Explanation: Scroll a view or container in the specified direction
"""
# 2) Your prompt (instruction + desired output format)
task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
query = question_description.format(img_size_string, task_desc, prev_txt)
query = query + '\n' + '''The response should be structured in the following format:
<thinking>Your step-by-step thought process here...</thinking>
<answer>
{
"action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
"action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
"value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
"point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''
resp = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful GUI agent."},
{"role": "user", "content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
{"type": "text", "text": prompt},
]},
],
temperature=0.0,
max_completion_tokens=1024,
)
print(resp.choices[0].message.content)
```
Run:
```bash
python minimal_infer.py
```
---
## Notes
* Replace `screen.png` with your own screenshot file.
* If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
* The example assumes your vLLM server is running locally on port `8000`.
|