Add pipeline tag, library metadata, and paper link
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,34 +1,43 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- GUI-Libra/GUI-Libra-81K-RL
|
| 5 |
- GUI-Libra/GUI-Libra-81K-SFT
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
-
|
| 9 |
-
|
|
|
|
| 10 |
tags:
|
| 11 |
- VLM
|
| 12 |
- GUI
|
| 13 |
- agent
|
| 14 |
---
|
| 15 |
|
| 16 |
-
#
|
| 17 |
|
| 18 |
-
|
| 19 |
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
**Website:** https://GUI-Libra.github.io
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
# Usage
|
| 26 |
## 1) Start an OpenAI-compatible vLLM server
|
| 27 |
|
| 28 |
```bash
|
| 29 |
pip install -U vllm
|
| 30 |
-
vllm serve GUI-Libra/GUI-Libra-
|
| 31 |
-
```
|
| 32 |
|
| 33 |
* Endpoint: `http://localhost:8000/v1`
|
| 34 |
* The `api_key` here must match `--api-key`.
|
|
@@ -79,11 +88,19 @@ action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right
|
|
| 79 |
|
| 80 |
task_desc = 'Go to Amazon.com and buy a math book'
|
| 81 |
prev_txt = ''
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
|
| 84 |
query = question_description.format(img_size_string, task_desc, prev_txt)
|
| 85 |
|
| 86 |
-
query = query + '
|
|
|
|
| 87 |
<think>Your step-by-step thought process here...</think>
|
| 88 |
<answer>
|
| 89 |
{
|
|
@@ -101,7 +118,7 @@ resp = client.chat.completions.create(
|
|
| 101 |
{"role": "user", "content": [
|
| 102 |
{"type": "image_url",
|
| 103 |
"image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
|
| 104 |
-
{"type": "text", "text":
|
| 105 |
]},
|
| 106 |
],
|
| 107 |
temperature=0.0,
|
|
@@ -111,19 +128,16 @@ resp = client.chat.completions.create(
|
|
| 111 |
print(resp.choices[0].message.content)
|
| 112 |
```
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
datasets:
|
| 5 |
- GUI-Libra/GUI-Libra-81K-RL
|
| 6 |
- GUI-Libra/GUI-Libra-81K-SFT
|
| 7 |
language:
|
| 8 |
- en
|
| 9 |
+
license: apache-2.0
|
| 10 |
+
pipeline_tag: image-text-to-text
|
| 11 |
+
library_name: transformers
|
| 12 |
tags:
|
| 13 |
- VLM
|
| 14 |
- GUI
|
| 15 |
- agent
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# GUI-Libra-7B
|
| 19 |
|
| 20 |
+
GUI-Libra is a native GUI agent model designed to reason and act based on UI screenshots. It is presented in the paper [GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL](https://huggingface.co/papers/2602.22190).
|
| 21 |
|
| 22 |
+
**GitHub:** [GUI-Libra/GUI-Libra](https://github.com/GUI-Libra/GUI-Libra)
|
| 23 |
+
**Website:** [GUI-Libra Project Page](https://GUI-Libra.github.io)
|
| 24 |
|
| 25 |
+
## Introduction
|
|
|
|
| 26 |
|
| 27 |
+
GUI-Libra is a post-training framework that transforms open-source VLMs into strong native GUI agents. These models can perceive a screenshot, think step-by-step using Chain-of-Thought (CoT), and output executable actions within a single forward pass. This version is based on the [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) architecture.
|
| 28 |
+
|
| 29 |
+
The framework addresses two main challenges in GUI agent training:
|
| 30 |
+
1. **Scarcity of action-aligned reasoning data**: Mitigated by the GUI-Libra-81K dataset.
|
| 31 |
+
2. **Grounding vs. Reasoning**: Solved via action-aware SFT that balances thought process with coordinate accuracy.
|
| 32 |
+
3. **Partial Verifiability**: Addressed using conservative RL (KL-regularized GRPO).
|
| 33 |
|
| 34 |
# Usage
|
| 35 |
## 1) Start an OpenAI-compatible vLLM server
|
| 36 |
|
| 37 |
```bash
|
| 38 |
pip install -U vllm
|
| 39 |
+
vllm serve GUI-Libra/GUI-Libra-7B --port 8000 --api-key token-abc123
|
| 40 |
+
```
|
| 41 |
|
| 42 |
* Endpoint: `http://localhost:8000/v1`
|
| 43 |
* The `api_key` here must match `--api-key`.
|
|
|
|
| 88 |
|
| 89 |
task_desc = 'Go to Amazon.com and buy a math book'
|
| 90 |
prev_txt = ''
|
| 91 |
+
# Note: replace img_size with your screenshot dimensions, e.g., [1920, 1080]
|
| 92 |
+
img_size = [1920, 1080]
|
| 93 |
+
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.
|
| 94 |
+
|
| 95 |
+
Instruction: {}
|
| 96 |
+
|
| 97 |
+
Interaction History: {}
|
| 98 |
+
'''
|
| 99 |
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
|
| 100 |
query = question_description.format(img_size_string, task_desc, prev_txt)
|
| 101 |
|
| 102 |
+
query = query + '
|
| 103 |
+
' + '''The response should be structured in the following format:
|
| 104 |
<think>Your step-by-step thought process here...</think>
|
| 105 |
<answer>
|
| 106 |
{
|
|
|
|
| 118 |
{"role": "user", "content": [
|
| 119 |
{"type": "image_url",
|
| 120 |
"image_url": {"url": f"data:image/png;base64,{img_b64}", "detail": "high"}},
|
| 121 |
+
{"type": "text", "text": query},
|
| 122 |
]},
|
| 123 |
],
|
| 124 |
temperature=0.0,
|
|
|
|
| 128 |
print(resp.choices[0].message.content)
|
| 129 |
```
|
| 130 |
|
| 131 |
+
## Citation
|
| 132 |
+
|
| 133 |
+
```bibtex
|
| 134 |
+
@misc{yang2026guilibratrainingnativegui,
|
| 135 |
+
title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL},
|
| 136 |
+
author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang},
|
| 137 |
+
year={2026},
|
| 138 |
+
eprint={2602.22190},
|
| 139 |
+
archivePrefix={arXiv},
|
| 140 |
+
primaryClass={cs.LG},
|
| 141 |
+
url={https://arxiv.org/abs/2602.22190},
|
| 142 |
+
}
|
| 143 |
+
```
|
|
|
|
|
|
|
|
|