Safetensors
English
qwen2_5_vl
VLM
GUI
agent
GUI-Libra-7B / README.md
Ray2333's picture
Create README.md
0355b64 verified
---
license: apache-2.0
datasets:
- GUI-Libra/GUI-Libra-81K-RL
- GUI-Libra/GUI-Libra-81K-SFT
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
tags:
- VLM
- GUI
- agent
---
# Introduction
The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
**GitHub:** https://github.com/GUI-Libra/GUI-Libra
**Website:** https://GUI-Libra.github.io
# Usage
## 1) Start an OpenAI-compatible vLLM server
```bash
pip install -U vllm
vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
````
* Endpoint: `http://localhost:8000/v1`
* The `api_key` here must match `--api-key`.
## 2) Minimal Python example (prompt + image → request)
Install dependencies:
```bash
pip install -U openai
```
Create `minimal_infer.py`:
```python
import base64
from openai import OpenAI
MODEL = "GUI-Libra/GUI-Libra-7B"
client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
def b64_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
# 1) Your screenshot path
img_b64 = b64_image("screen.png")
system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
## Explanation: Tap or click a specific UI element and provide its coordinates
action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
## Explanation: Select an item from a list or dropdown menu
action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
## Explanation: Enter text into a specific input field or at the current focus if coordinate is None
action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
## Explanation: Press a specified key on the keyboard
action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
## Explanation: Scroll a view or container in the specified direction
"""
# 2) Your prompt (instruction + desired output format)
task_desc = 'Go to Amazon.com and buy a math book'
prev_txt = ''
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
query = question_description.format(img_size_string, task_desc, prev_txt)
query = query + '\n' + '''The response should be structured in the following format:
<think>Your step-by-step thought process here...</think>
<answer>
{
"action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
"action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
"value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
"point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
}
</answer>'''
resp = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful GUI agent."},
{"role": "user", "content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
{"type": "text", "text": prompt},
]},
],
temperature=0.0,
max_completion_tokens=1024,
)
print(resp.choices[0].message.content)
```
Run:
```bash
python minimal_infer.py
```
---
## Notes
* Replace `screen.png` with your own screenshot file.
* If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
* The example assumes your vLLM server is running locally on port `8000`.