Safetensors
English
qwen2_5_vl
VLM
GUI
agent
Ray2333 commited on
Commit
98ec8af
·
verified ·
1 Parent(s): 7b0c505

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - GUI-Libra/GUI-Libra-81K-RL
5
+ - GUI-Libra/GUI-Libra-81K-SFT
6
+ language:
7
+ - en
8
+ base_model:
9
+ - Qwen/Qwen2.5-VL-3B-Instruct
10
+ tags:
11
+ - VLM
12
+ - GUI
13
+ - agent
14
+ ---
15
+
16
+ # Introduction
17
+
18
+ The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
19
+
20
+
21
+ **GitHub:** https://github.com/GUI-Libra/GUI-Libra
22
+ **Website:** https://GUI-Libra.github.io
23
+
24
+
25
+ # Usage
26
+ ## 1) Start an OpenAI-compatible vLLM server
27
+
28
+ ```bash
29
+ pip install -U vllm
30
+ vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
31
+ ````
32
+
33
+ * Endpoint: `http://localhost:8000/v1`
34
+ * The `api_key` here must match `--api-key`.
35
+
36
+
37
+ ## 2) Minimal Python example (prompt + image → request)
38
+
39
+ Install dependencies:
40
+
41
+ ```bash
42
+ pip install -U openai
43
+ ```
44
+
45
+ Create `minimal_infer.py`:
46
+
47
+ ```python
48
+ import base64
49
+ from openai import OpenAI
50
+
51
+ MODEL = "GUI-Libra/GUI-Libra-3B"
52
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="token-abc123")
53
+
54
+ def b64_image(path: str) -> str:
55
+ with open(path, "rb") as f:
56
+ return base64.b64encode(f.read()).decode("utf-8")
57
+
58
+ # 1) Your screenshot path
59
+ img_b64 = b64_image("screen.png")
60
+
61
+ system_prompt = """You are a GUI agent. You are given a task and a screenshot of the screen. You need to choose actions from the the following list:
62
+ action_type: Click, action_target: Element description, value: None, point_2d: [x, y]
63
+ ## Explanation: Tap or click a specific UI element and provide its coordinates
64
+
65
+ action_type: Select, action_target: Element description, value: Value to select, point_2d: [x, y] or None
66
+ ## Explanation: Select an item from a list or dropdown menu
67
+
68
+ action_type: Write, action_target: Element description or None, value: Text to enter, point_2d: [x, y] or None
69
+ ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None
70
+
71
+ action_type: KeyboardPress, action_target: None, value: Key name (e.g., "enter"), point_2d: None
72
+ ## Explanation: Press a specified key on the keyboard
73
+
74
+ action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right", point_2d: None
75
+ ## Explanation: Scroll a view or container in the specified direction
76
+ """
77
+
78
+ # 2) Your prompt (instruction + desired output format)
79
+
80
+ task_desc = 'Go to Amazon.com and buy a math book'
81
+ prev_txt = ''
82
+ question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
83
+ img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
84
+ query = question_description.format(img_size_string, task_desc, prev_txt)
85
+
86
+ query = query + '\n' + '''The response should be structured in the following format:
87
+ <think>Your step-by-step thought process here...</think>
88
+ <answer>
89
+ {
90
+ "action_type": "the type of action to perform, e.g., Click, Write, Scroll, Answer, etc. Please follow the system prompt for available actions.",
91
+ "action_target": "the description of the target of the action, such as the color, text, or position on the screen of the UI element to interact with",
92
+ "value": "the input text or direction ('up', 'down', 'left', 'right') for the 'scroll' action, if applicable; otherwise, use 'None'",
93
+ "point_2d": [x, y] # the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]
94
+ }
95
+ </answer>'''
96
+
97
+ resp = client.chat.completions.create(
98
+ model=MODEL,
99
+ messages=[
100
+ {"role": "system", "content": "You are a helpful GUI agent."},
101
+ {"role": "user", "content": [
102
+ {"type": "image_url",
103
+ "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
104
+ {"type": "text", "text": prompt},
105
+ ]},
106
+ ],
107
+ temperature=0.0,
108
+ max_completion_tokens=1024,
109
+ )
110
+
111
+ print(resp.choices[0].message.content)
112
+ ```
113
+
114
+ Run:
115
+
116
+ ```bash
117
+ python minimal_infer.py
118
+ ```
119
+
120
+ ---
121
+
122
+ ## Notes
123
+
124
+ * Replace `screen.png` with your own screenshot file.
125
+ * If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
126
+ * The example assumes your vLLM server is running locally on port `8000`.
127
+
128
+
129
+