GUI-Libra
/

GUI-Libra-3B

+---
+license: apache-2.0
+datasets:
+- GUI-Libra/GUI-Libra-81K-RL
+- GUI-Libra/GUI-Libra-81K-SFT
+language:
+- en
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+tags:
+- VLM
+- GUI
+- agent
+---
+# Introduction
+The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
+**GitHub:** https://github.com/GUI-Libra/GUI-Libra
+**Website:** https://GUI-Libra.github.io
+# Usage
+## 1) Start an OpenAI-compatible vLLM server
+```bash
+pip install -U vllm
+vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
+````
+* Endpoint: `http://localhost:8000/v1`
+* The `api_key` here must match `--api-key`.
+## 2) Minimal Python example (prompt + image → request)
+Install dependencies:
+```bash
+pip install -U openai
+```
+Create `minimal_infer.py`:
+```python
+import base64
+from openai import OpenAI
+MODEL = "GUI-Libra/GUI-Libra-3B"
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
+def b64_image(path: str) -> str:
+    with open(path, "rb") as f:
+        return base64.b64encode(f.read()).decode("utf-8")
+# 1) Your screenshot path
+img_b64 = b64_image("screen.png")
+system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
+action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
+    ## Explanation: Tap or click a specific UI element and provide its coordinates
+action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
+    ## Explanation: Select an item from a list or dropdown menu
+action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
+    ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None
+action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
+    ## Explanation: Press a specified key on the keyboard
+action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
+    ## Explanation: Scroll a view or container in the specified direction
+"""
+# 2) Your prompt (instruction + desired output format)
+task_desc = 'Go to Amazon.com and buy a math book'
+prev_txt = ''
+question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
+img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
+query = question_description.format(img_size_string, task_desc, prev_txt)
+query = query + '\n' + '''The response should be structured in the following format:
+<think>Your step-by-step thought process here...</think>
+<answer>
+{
+  "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
+  "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
+  "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
+  "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
+}
+</answer>'''
+resp = client.chat.completions.create(
+    model=MODEL,
+    messages=[
+        {"role": "system", "content": "You are a helpful GUI agent."},
+        {"role": "user", "content": [
+            {"type": "image_url",
+             "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
+            {"type": "text", "text": prompt},
+        ]},
+    ],
+    temperature=0.0,
+    max_completion_tokens=1024,
+)
+print(resp.choices[0].message.content)
+```
+Run:
+```bash
+python minimal_infer.py
+```
+---
+## Notes
+* Replace `screen.png` with your own screenshot file.
+* If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
+* The example assumes your vLLM server is running locally on port `8000`.