Safetensors
English
qwen2_5_vl
VLM
GUI
agent

Add pipeline tag, library metadata, and paper link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +42 -28
README.md CHANGED
@@ -1,34 +1,43 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - GUI-Libra/GUI-Libra-81K-RL
5
  - GUI-Libra/GUI-Libra-81K-SFT
6
  language:
7
  - en
8
- base_model:
9
- - Qwen/Qwen2.5-VL-7B-Instruct
 
10
  tags:
11
  - VLM
12
  - GUI
13
  - agent
14
  ---
15
 
16
- # Introduction
17
 
18
- The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
19
 
 
 
20
 
21
- **GitHub:** https://github.com/GUI-Libra/GUI-Libra
22
- **Website:** https://GUI-Libra.github.io
23
 
 
 
 
 
 
 
24
 
25
  # Usage
26
  ## 1) Start an OpenAI-compatible vLLM server
27
 
28
  ```bash
29
  pip install -U vllm
30
- vllm serve GUI-Libra/GUI-Libra-4B --port 8000 --api-key token-abc123
31
- ````
32
 
33
  * Endpoint: `http://localhost:8000/v1`
34
  * The `api_key` here must match `--api-key`.
@@ -79,11 +88,19 @@ action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right
79
 
80
  task_desc = 'Go to Amazon.com and buy a math book'
81
  prev_txt = ''
82
- question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.\n\nInstruction: {}\n\nInteraction History: {}\n'''
 
 
 
 
 
 
 
83
  img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
84
  query = question_description.format(img_size_string, task_desc, prev_txt)
85
 
86
- query = query + '\n' + '''The response should be structured in the following format:
 
87
  <think>Your step-by-step thought process here...</think>
88
  <answer>
89
  {
@@ -101,7 +118,7 @@ resp = client.chat.completions.create(
101
  {"role": "user", "content": [
102
  {"type": "image_url",
103
  "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
104
- {"type": "text", "text": prompt},
105
  ]},
106
  ],
107
  temperature=0.0,
@@ -111,19 +128,16 @@ resp = client.chat.completions.create(
111
  print(resp.choices[0].message.content)
112
  ```
113
 
114
- Run:
115
-
116
- ```bash
117
- python minimal_infer.py
118
- ```
119
-
120
- ---
121
-
122
- ## Notes
123
-
124
- * Replace `screen.png` with your own screenshot file.
125
- * If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
126
- * The example assumes your vLLM server is running locally on port `8000`.
127
-
128
-
129
-
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  datasets:
5
  - GUI-Libra/GUI-Libra-81K-RL
6
  - GUI-Libra/GUI-Libra-81K-SFT
7
  language:
8
  - en
9
+ license: apache-2.0
10
+ pipeline_tag: image-text-to-text
11
+ library_name: transformers
12
  tags:
13
  - VLM
14
  - GUI
15
  - agent
16
  ---
17
 
18
+ # GUI-Libra-7B
19
 
20
+ GUI-Libra is a native GUI agent model designed to reason and act based on UI screenshots. It is presented in the paper [GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL](https://huggingface.co/papers/2602.22190).
21
 
22
+ **GitHub:** [GUI-Libra/GUI-Libra](https://github.com/GUI-Libra/GUI-Libra)
23
+ **Website:** [GUI-Libra Project Page](https://GUI-Libra.github.io)
24
 
25
+ ## Introduction
 
26
 
27
+ GUI-Libra is a post-training framework that transforms open-source VLMs into strong native GUI agents. These models can perceive a screenshot, think step-by-step using Chain-of-Thought (CoT), and output executable actions within a single forward pass. This version is based on the [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) architecture.
28
+
29
+ The framework addresses two main challenges in GUI agent training:
30
+ 1. **Scarcity of action-aligned reasoning data**: Mitigated by the GUI-Libra-81K dataset.
31
+ 2. **Grounding vs. Reasoning**: Solved via action-aware SFT that balances thought process with coordinate accuracy.
32
+ 3. **Partial Verifiability**: Addressed using conservative RL (KL-regularized GRPO).
33
 
34
  # Usage
35
  ## 1) Start an OpenAI-compatible vLLM server
36
 
37
  ```bash
38
  pip install -U vllm
39
+ vllm serve GUI-Libra/GUI-Libra-7B --port 8000 --api-key token-abc123
40
+ ```
41
 
42
  * Endpoint: `http://localhost:8000/v1`
43
  * The `api_key` here must match `--api-key`.
 
88
 
89
  task_desc = 'Go to Amazon.com and buy a math book'
90
  prev_txt = ''
91
+ # Note: replace img_size with your screenshot dimensions, e.g., [1920, 1080]
92
+ img_size = [1920, 1080]
93
+ question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.
94
+
95
+ Instruction: {}
96
+
97
+ Interaction History: {}
98
+ '''
99
  img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
100
  query = question_description.format(img_size_string, task_desc, prev_txt)
101
 
102
+ query = query + '
103
+ ' + '''The response should be structured in the following format:
104
  <think>Your step-by-step thought process here...</think>
105
  <answer>
106
  {
 
118
  {"role": "user", "content": [
119
  {"type": "image_url",
120
  "image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
121
+ {"type": "text", "text": query},
122
  ]},
123
  ],
124
  temperature=0.0,
 
128
  print(resp.choices[0].message.content)
129
  ```
130
 
131
+ ## Citation
132
+
133
+ ```bibtex
134
+ @misc{yang2026guilibratrainingnativegui,
135
+ title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL},
136
+ author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang},
137
+ year={2026},
138
+ eprint={2602.22190},
139
+ archivePrefix={arXiv},
140
+ primaryClass={cs.LG},
141
+ url={https://arxiv.org/abs/2602.22190},
142
+ }
143
+ ```