Add pipeline tag and library name to model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +66 -14
README.md CHANGED
@@ -1,6 +1,9 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  ### UI-Venus
5
  This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
6
 
@@ -36,8 +39,6 @@ Key innovations include:
36
 
37
 
38
 
39
-
40
-
41
  ---
42
  ## Installation
43
 
@@ -152,17 +153,17 @@ def inference(instruction, image_path):
152
  ---
153
  ### Results on ScreenSpot-v2
154
 
155
- | **Model** | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
156
- |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------|
157
- | uitars-1.5 | - | - | - | - | - | - | 94.2 |
158
- | Seed-1.5-VL | - | - | - | - | - | - | 95.2 |
159
- | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 |
160
- | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 |
161
- | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 |
162
- | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 |
163
- | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 |
164
- | **UI-Venus-Ground-7B (Ours)** | **99.0** | **90.0** | **97.0** | **90.7** | **96.2** | **88.7** | **94.1** |
165
- | **UI-Venus-Ground-72B (Ours)** | **99.7** | **93.8** | **95.9** | **90.0** | **96.2** | **92.6** | **95.3** |
166
 
167
  ---
168
 
@@ -193,6 +194,57 @@ Scores are in percentage (%). `T` = Text, `I` = Icon.
193
  > πŸ” **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
194
 
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  # Citation
197
  Please consider citing if you find our work useful:
198
  ```plain
@@ -205,4 +257,4 @@ Please consider citing if you find our work useful:
205
  primaryClass={cs.CV},
206
  url={https://arxiv.org/abs/2508.10833},
207
  }
208
- ```
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
  ### UI-Venus
8
  This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
9
 
 
39
 
40
 
41
 
 
 
42
  ---
43
  ## Installation
44
 
 
153
  ---
154
  ### Results on ScreenSpot-v2
155
 
156
+ | **Model** | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
157
+ |---|---|---|---|---|---|---|---|
158
+ | uitars-1.5 | - | - | - | - | - | - | 94.2 |
159
+ | Seed-1.5-VL | - | - | - | - | - | - | 95.2 |
160
+ | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 |
161
+ | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 |
162
+ | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 |
163
+ | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 |
164
+ | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 |
165
+ | **UI-Venus-Ground-7B (Ours)** | **99.0** | **90.0** | **97.0** | **90.7** | **96.2** | **88.7** | **94.1** |
166
+ | **UI-Venus-Ground-72B (Ours)** | **99.7** | **93.8** | **95.9** | **90.0** | **96.2** | **92.6** | **95.3** |
167
 
168
  ---
169
 
 
194
  > πŸ” **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
195
 
196
 
197
+ ### Results on AndroidWorld
198
+ This is the compressed package of validation trajectories for **AndroidWorld**, including execution logs and navigation paths.
199
+ πŸ“₯ Download: [UI-Venus-androidworld.zip](vis_androidworld/UI-Venus-androidworld.zip)
200
+
201
+ | Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
202
+ |--------|--------------|-----------|------------|------------------------|
203
+ | **Closed-source Models** | | | | |
204
+ | GPT-4o| ❌ | βœ… | ❌ | 30.6 |
205
+ | ScaleTrack| ❌ | βœ… | ❌ | 44.0 |
206
+ | SeedVL-1.5 | ❌ | βœ… | βœ… | 62.1 |
207
+ | UI-TARS-1.5 | ❌ | ❌ | βœ… | 64.2 |
208
+ | **Open-source Models** | | | | |
209
+ | GUI-Critic-R1-7B | ❌ | βœ… | βœ… | 27.6 |
210
+ | Qwen2.5-VL-72B* | ❌ | ❌ | βœ… | 35.0 |
211
+ | UGround | βœ… | ❌ | βœ… | 44.0 |
212
+ | Aria-UI | βœ… | ❌ | βœ… | 44.8 |
213
+ | UI-TARS-72B | ❌ | ❌ | βœ… | 46.6 |
214
+ | GLM-4.5v | ❌ | ❌ | βœ… | 57.0 |
215
+ | **Ours** | | | | |
216
+ | UI-Venus-Navi-7B | ❌ | ❌ | βœ… | **49.1** |
217
+ | UI-Venus-Navi-72B | ❌ | ❌ | βœ… | **65.9** |
218
+
219
+ > **Table:** Performance comparison on **AndroidWorld** for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.
220
+
221
+
222
+ ### Results on AndroidControl and GUI-Odyssey
223
+
224
+ | Models | AndroidControl-Low<br>Type Acc. | AndroidControl-Low<br>Step SR | AndroidControl-High<br>Type Acc. | AndroidControl-High<br>Step SR | GUI-Odyssey<br>Type Acc. | GUI-Odyssey<br>Step SR |
225
+ |--------|-------------------------------|-----------------------------|-------------------------------|-----------------------------|------------------------|----------------------|
226
+ | **Closed-source Models** | | | | | | |
227
+ | GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
228
+ | **Open Source Models** | | | | | | |
229
+ | Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
230
+ | SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
231
+ | OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
232
+ | Aguvis-7B| - | 80.5 | - | 61.5 | - | - |
233
+ | Aguvis-72B| - | 84.4 | - | 66.4 | - | - |
234
+ | OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
235
+ | UI-TARS-7B| 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
236
+ | UI-TARS-72B| **98.1** | 91.3 | 85.2 | 74.7 | **95.4** | **88.6** |
237
+ | GUI-R1-7B| 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
238
+ | NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
239
+ | UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
240
+ | AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
241
+ | **Ours** | | | | | | |
242
+ | UI-Venus-Navi-7B | 97.1 | 92.4 | **86.5** | 76.1 | 87.3 | 71.5 |
243
+ | UI-Venus-Navi-72B | 96.7 | **92.9** | 85.9 | **77.2** | 87.2 | 72.4 |
244
+
245
+ > **Table:** Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.
246
+
247
+
248
  # Citation
249
  Please consider citing if you find our work useful:
250
  ```plain
 
257
  primaryClass={cs.CV},
258
  url={https://arxiv.org/abs/2508.10833},
259
  }
260
+ ```