nielsr HF Staff commited on
Commit
5d222ce
Β·
verified Β·
1 Parent(s): 3a7156b

Improve model card: Add pipeline tag, library name, and project page

Browse files

This PR enhances the model card for UI-Venus by:

* Adding `pipeline_tag: image-text-to-text` to the metadata, making the model discoverable under the relevant task.
* Adding `library_name: transformers` to the metadata, enabling the "Use in Transformers" widget.
* Including the project page URL and an explicit link to the GitHub repository for better context and navigation.

Please review and merge this PR if everything looks good.

Files changed (1) hide show
  1. README.md +63 -10
README.md CHANGED
@@ -1,8 +1,15 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  ### UI-Venus
5
- This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
 
 
 
 
6
 
7
 
8
 
@@ -34,12 +41,8 @@ Key innovations include:
34
  - **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
35
  - **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.
36
 
37
-
38
-
39
-
40
-
41
  ---
42
- ## Installation
43
 
44
  First, install the required dependencies:
45
 
@@ -49,8 +52,7 @@ pip install transformers==4.49.0 qwen-vl-utils
49
  ---
50
 
51
 
52
-
53
- ## Quick Start
54
 
55
  Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as **ScreenSpot**, including data format, annotation structure, and metric calculation.
56
 
@@ -150,7 +152,7 @@ def inference(instruction, image_path):
150
  return result_dict
151
  ```
152
  ---
153
- ### Results on ScreenSpot-v2
154
 
155
  | **Model** | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
156
  |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------|
@@ -193,6 +195,57 @@ Scores are in percentage (%). `T` = Text, `I` = Icon.
193
  > πŸ” **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
194
 
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  # Citation
197
  Please consider citing if you find our work useful:
198
  ```plain
@@ -205,4 +258,4 @@ Please consider citing if you find our work useful:
205
  primaryClass={cs.CV},
206
  url={https://arxiv.org/abs/2508.10833},
207
  }
208
- ```
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
  ### UI-Venus
8
+ This repository contains the UI-Venus model from the report [UI-Venus Technical Report: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833).
9
+ Project page: https://osatlas.github.io/
10
+ Code: https://github.com/inclusionAI/UI-Venus
11
+
12
+ UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
13
 
14
 
15
 
 
41
  - **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
42
  - **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.
43
 
 
 
 
 
44
  ---
45
+ ## Installation
46
 
47
  First, install the required dependencies:
48
 
 
52
  ---
53
 
54
 
55
+ ## Quick Start
 
56
 
57
  Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as **ScreenSpot**, including data format, annotation structure, and metric calculation.
58
 
 
152
  return result_dict
153
  ```
154
  ---
155
+ ### Results on ScreenSpot-v2
156
 
157
  | **Model** | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
158
  |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------|
 
195
  > πŸ” **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
196
 
197
 
198
+ ### Results on AndroidWorld
199
+ This is the compressed package of validation trajectories for **AndroidWorld**, including execution logs and navigation paths.
200
+ πŸ“₯ Download: [UI-Venus-androidworld.zip](vis_androidworld/UI-Venus-androidworld.zip)
201
+
202
+ | Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
203
+ |--------|--------------|-----------|------------|------------------------|
204
+ | **Closed-source Models** | | | | |
205
+ | GPT-4o| ❌ | βœ… | ❌ | 30.6 |
206
+ | ScaleTrack| ❌ | βœ… | ❌ | 44.0 |
207
+ | SeedVL-1.5 | ❌ | βœ… | βœ… | 62.1 |
208
+ | UI-TARS-1.5 | ❌ | ❌ | βœ… | 64.2 |
209
+ | **Open-source Models** | | | | |
210
+ | GUI-Critic-R1-7B | ❌ | βœ… | βœ… | 27.6 |
211
+ | Qwen2.5-VL-72B* | ❌ | ❌ | βœ… | 35.0 |
212
+ | UGround | βœ… | ❌ | βœ… | 44.0 |
213
+ | Aria-UI | βœ… | ❌ | βœ… | 44.8 |
214
+ | UI-TARS-72B | ❌ | ❌ | βœ… | 46.6 |
215
+ | GLM-4.5v | ❌ | ❌ | βœ… | 57.0 |
216
+ | **Ours** | | | | |
217
+ | UI-Venus-Navi-7B | ❌ | ❌ | βœ… | **49.1** |
218
+ | UI-Venus-Navi-72B | ❌ | ❌ | βœ… | **65.9** |
219
+
220
+ > **Table:** Performance comparison on **AndroidWorld** for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.
221
+
222
+
223
+ ### Results on AndroidControl and GUI-Odyssey
224
+
225
+ | Models | AndroidControl-Low<br>Type Acc. | AndroidControl-Low<br>Step SR | AndroidControl-High<br>Type Acc. | AndroidControl-High<br>Step SR | GUI-Odyssey<br>Type Acc. | GUI-Odyssey<br>Step SR |
226
+ |--------|-------------------------------|-----------------------------|-------------------------------|-----------------------------|------------------------|----------------------|
227
+ | **Closed-source Models** | | | | | | |
228
+ | GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
229
+ | **Open Source Models** | | | | | | |
230
+ | Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
231
+ | SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
232
+ | OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
233
+ | Aguvis-7B| - | 80.5 | - | 61.5 | - | - |
234
+ | Aguvis-72B| - | 84.4 | - | 66.4 | - | - |
235
+ | OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
236
+ | UI-TARS-7B| 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
237
+ | UI-TARS-72B| **98.1** | 91.3 | 85.2 | 74.7 | **95.4** | **88.6** |
238
+ | GUI-R1-7B| 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
239
+ | NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
240
+ | UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
241
+ | AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
242
+ | **Ours** | | | | | | |
243
+ | UI-Venus-Navi-7B | 97.1 | 92.4 | **86.5** | 76.1 | 87.3 | 71.5 |
244
+ | UI-Venus-Navi-72B | 96.7 | **92.9** | 85.9 | **77.2** | 87.2 | 72.4 |
245
+
246
+ > **Table:** Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.
247
+
248
+
249
  # Citation
250
  Please consider citing if you find our work useful:
251
  ```plain
 
258
  primaryClass={cs.CV},
259
  url={https://arxiv.org/abs/2508.10833},
260
  }
261
+ ```