Add pipeline_tag, library_name and paper reference
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,34 +1,35 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
datasets:
|
| 4 |
- GUI-Libra/GUI-Libra-81K-RL
|
| 5 |
- GUI-Libra/GUI-Libra-81K-SFT
|
| 6 |
language:
|
| 7 |
- en
|
| 8 |
-
|
| 9 |
-
|
|
|
|
| 10 |
tags:
|
| 11 |
- VLM
|
| 12 |
- GUI
|
| 13 |
- agent
|
| 14 |
---
|
| 15 |
|
| 16 |
-
#
|
| 17 |
-
|
| 18 |
-
The models from paper "GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL".
|
| 19 |
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
**Website:** https://GUI-Libra.github.io
|
| 23 |
|
|
|
|
| 24 |
|
| 25 |
# Usage
|
| 26 |
## 1) Start an OpenAI-compatible vLLM server
|
| 27 |
|
| 28 |
```bash
|
| 29 |
pip install -U vllm
|
| 30 |
-
vllm serve GUI-Libra/GUI-Libra-
|
| 31 |
-
```
|
| 32 |
|
| 33 |
* Endpoint: `http://localhost:8000/v1`
|
| 34 |
* The `api_key` here must match `--api-key`.
|
|
@@ -79,11 +80,17 @@ action_type: Scroll, action_target: None, value: "up" | "down" | "left" | "right
|
|
| 79 |
|
| 80 |
task_desc = 'Go to Amazon.com and buy a math book'
|
| 81 |
prev_txt = ''
|
| 82 |
-
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
|
| 84 |
query = question_description.format(img_size_string, task_desc, prev_txt)
|
| 85 |
|
| 86 |
-
query = query + '
|
|
|
|
| 87 |
<think>Your step-by-step thought process here...</think>
|
| 88 |
<answer>
|
| 89 |
{
|
|
@@ -111,19 +118,18 @@ resp = client.chat.completions.create(
|
|
| 111 |
print(resp.choices[0].message.content)
|
| 112 |
```
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
```bash
|
| 117 |
-
python minimal_infer.py
|
| 118 |
-
```
|
| 119 |
-
|
| 120 |
-
---
|
| 121 |
-
|
| 122 |
-
## Notes
|
| 123 |
-
|
| 124 |
-
* Replace `screen.png` with your own screenshot file.
|
| 125 |
-
* If you hit OOM or slowdowns, reduce image size or run fewer concurrent requests.
|
| 126 |
-
* The example assumes your vLLM server is running locally on port `8000`.
|
| 127 |
-
|
| 128 |
|
|
|
|
| 129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 4 |
datasets:
|
| 5 |
- GUI-Libra/GUI-Libra-81K-RL
|
| 6 |
- GUI-Libra/GUI-Libra-81K-SFT
|
| 7 |
language:
|
| 8 |
- en
|
| 9 |
+
license: apache-2.0
|
| 10 |
+
library_name: transformers
|
| 11 |
+
pipeline_tag: image-text-to-text
|
| 12 |
tags:
|
| 13 |
- VLM
|
| 14 |
- GUI
|
| 15 |
- agent
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# GUI-Libra-3B
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
[**Project Page**](https://gui-libra.github.io) | [**Paper**](https://huggingface.co/papers/2602.22190) | [**GitHub**](https://github.com/GUI-Libra/GUI-Libra)
|
| 21 |
|
| 22 |
+
GUI-Libra is a post-training framework that turns open-source VLMs into strong native GUI agents—models that see a screenshot, think step-by-step, and output an executable action, all within a single forward pass.
|
|
|
|
| 23 |
|
| 24 |
+
This model is fine-tuned from [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) using action-aware SFT and conservative reinforcement learning (GRPO). It addresses challenges such as action-grounding alignment and partial verifiability in GUI navigation tasks.
|
| 25 |
|
| 26 |
# Usage
|
| 27 |
## 1) Start an OpenAI-compatible vLLM server
|
| 28 |
|
| 29 |
```bash
|
| 30 |
pip install -U vllm
|
| 31 |
+
vllm serve GUI-Libra/GUI-Libra-3B --port 8000 --api-key token-abc123
|
| 32 |
+
```
|
| 33 |
|
| 34 |
* Endpoint: `http://localhost:8000/v1`
|
| 35 |
* The `api_key` here must match `--api-key`.
|
|
|
|
| 80 |
|
| 81 |
task_desc = 'Go to Amazon.com and buy a math book'
|
| 82 |
prev_txt = ''
|
| 83 |
+
question_description = '''Please generate the next move according to the UI screenshot {}, instruction and previous actions.
|
| 84 |
+
|
| 85 |
+
Instruction: {}
|
| 86 |
+
|
| 87 |
+
Interaction History: {}
|
| 88 |
+
'''
|
| 89 |
img_size_string = '(original image size {}x{})'.format(img_size[0], img_size[1])
|
| 90 |
query = question_description.format(img_size_string, task_desc, prev_txt)
|
| 91 |
|
| 92 |
+
query = query + '
|
| 93 |
+
' + '''The response should be structured in the following format:
|
| 94 |
<think>Your step-by-step thought process here...</think>
|
| 95 |
<answer>
|
| 96 |
{
|
|
|
|
| 118 |
print(resp.choices[0].message.content)
|
| 119 |
```
|
| 120 |
|
| 121 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
If you find GUI-Libra useful for your research, please cite:
|
| 124 |
|
| 125 |
+
```bibtex
|
| 126 |
+
@misc{yang2026guilibratrainingnativegui,
|
| 127 |
+
title={GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL},
|
| 128 |
+
author={Rui Yang and Qianhui Wu and Zhaoyang Wang and Hanyang Chen and Ke Yang and Hao Cheng and Huaxiu Yao and Baoling Peng and Huan Zhang and Jianfeng Gao and Tong Zhang},
|
| 129 |
+
year={2026},
|
| 130 |
+
eprint={2602.22190},
|
| 131 |
+
archivePrefix={arXiv},
|
| 132 |
+
primaryClass={cs.LG},
|
| 133 |
+
url={https://arxiv.org/abs/2602.22190},
|
| 134 |
+
}
|
| 135 |
+
```
|