Improve model card: Add pipeline tag, library name, and project page

This PR enhances the model card for UI-Venus by:

* Adding `pipeline_tag: image-text-to-text` to the metadata, making the model discoverable under the relevant task.
* Adding `library_name: transformers` to the metadata, enabling the "Use in Transformers" widget.
* Including the project page URL and an explicit link to the GitHub repository for better context and navigation.

Please review and merge this PR if everything looks good.

Files changed (1) hide show

README.md +63 -10

README.md CHANGED Viewed

@@ -1,8 +1,15 @@
 ---
 license: apache-2.0
 ---
 ### UI-Venus
-This repository contains the UI-Venus model from the report [UI-Venus: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833). UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
@@ -34,12 +41,8 @@ Key innovations include:
 - **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
 - **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.
 ---
-##  Installation
 First, install the required dependencies:
@@ -49,8 +52,7 @@ pip install transformers==4.49.0 qwen-vl-utils
 ---
-##  Quick Start
 Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as **ScreenSpot**, including data format, annotation structure, and metric calculation.
@@ -150,7 +152,7 @@ def inference(instruction, image_path):
     return result_dict
 ```
 ---
-###  Results on ScreenSpot-v2
 | **Model**                | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
 |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------|
@@ -193,6 +195,57 @@ Scores are in percentage (%). `T` = Text, `I` = Icon.
 > 🔝 **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
 # Citation
 Please consider citing if you find our work useful:
 ```plain
@@ -205,4 +258,4 @@ Please consider citing if you find our work useful:
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2508.10833},
 }
-```

 ---
 license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 ### UI-Venus
+This repository contains the UI-Venus model from the report [UI-Venus Technical Report: Building High-performance UI Agents with RFT](https://arxiv.org/abs/2508.10833).
+Project page: https://osatlas.github.io/
+Code: https://github.com/inclusionAI/UI-Venus
+UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld.
 - **Efficient Data Cleaning**: Trained on several hundred thousand high-quality samples to ensure robustness.
 - **Self-Evolving Trajectory History Alignment & Sparse Action Enhancement**: Improves reasoning coherence and action distribution for better long-horizon planning.
 ---
+## Installation
 First, install the required dependencies:
 ---
+## Quick Start
 Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as **ScreenSpot**, including data format, annotation structure, and metric calculation.
     return result_dict
 ```
 ---
+### Results on ScreenSpot-v2
 | **Model**                | **Mobile Text** | **Mobile Icon** | **Desktop Text** | **Desktop Icon** | **Web Text** | **Web Icon** | **Avg.** |
 |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------|
 > 🔝 **Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorld_G(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks.**
+### Results on AndroidWorld
+This is the compressed package of validation trajectories for **AndroidWorld**, including execution logs and navigation paths.
+📥 Download: [UI-Venus-androidworld.zip](vis_androidworld/UI-Venus-androidworld.zip)
+| Models | With Planner | A11y Tree | Screenshot | Success Rate (pass@1) |
+|--------|--------------|-----------|------------|------------------------|
+| **Closed-source Models** | | | | |
+| GPT-4o| ❌ | ✅ | ❌ | 30.6 |
+| ScaleTrack| ❌ | ✅ | ❌ | 44.0 |
+| SeedVL-1.5 | ❌ | ✅ | ✅ | 62.1 |
+| UI-TARS-1.5 | ❌ | ❌ | ✅ | 64.2 |
+| **Open-source Models** | | | | |
+| GUI-Critic-R1-7B | ❌ | ✅ | ✅ | 27.6 |
+| Qwen2.5-VL-72B* | ❌ | ❌ | ✅ | 35.0 |
+| UGround | ✅ | ❌ | ✅ | 44.0 |
+| Aria-UI | ✅ | ❌ | ✅ | 44.8 |
+| UI-TARS-72B | ❌ | ❌ | ✅ | 46.6 |
+| GLM-4.5v | ❌ | ❌ | ✅ | 57.0 |
+| **Ours** | | | | |
+| UI-Venus-Navi-7B | ❌ | ❌ | ✅ | **49.1** |
+| UI-Venus-Navi-72B | ❌ | ❌ | ✅ | **65.9** |
+> **Table:** Performance comparison on **AndroidWorld** for end-to-end models. Our UI-Venus-Navi-72B achieves state-of-the-art performance, outperforming all baseline methods across different settings.
+### Results on AndroidControl and GUI-Odyssey
+| Models | AndroidControl-Low<br>Type Acc. | AndroidControl-Low<br>Step SR | AndroidControl-High<br>Type Acc. | AndroidControl-High<br>Step SR | GUI-Odyssey<br>Type Acc. | GUI-Odyssey<br>Step SR |
+|--------|-------------------------------|-----------------------------|-------------------------------|-----------------------------|------------------------|----------------------|
+| **Closed-source Models** | | | | | | |
+| GPT-4o | 74.3 | 19.4 | 66.3 | 20.8 | 34.3 | 3.3 |
+| **Open Source Models** | | | | | | |
+| Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 59.5 | 46.3 |
+| SeeClick | 93.0 | 75.0 | 82.9 | 59.1 | 71.0 | 53.9 |
+| OS-Atlas-7B | 93.6 | 85.2 | 85.2 | 71.2 | 84.5 | 62.0 |
+| Aguvis-7B| - | 80.5 | - | 61.5 | - | - |
+| Aguvis-72B| - | 84.4 | - | 66.4 | - | - |
+| OS-Genesis-7B | 90.7 | 74.2 | 66.2 | 44.5 | - | - |
+| UI-TARS-7B| 98.0 | 90.8 | 83.7 | 72.5 | 94.6 | 87.0 |
+| UI-TARS-72B| **98.1** | 91.3 | 85.2 | 74.7 | **95.4** | **88.6** |
+| GUI-R1-7B| 85.2 | 66.5 | 71.6 | 51.7 | 65.5 | 38.8 |
+| NaviMaster-7B | 85.6 | 69.9 | 72.9 | 54.0 | - | - |
+| UI-AGILE-7B | 87.7 | 77.6 | 80.1 | 60.6 | - | - |
+| AgentCPM-GUI | 94.4 | 90.2 | 77.7 | 69.2 | 90.0 | 75.0 |
+| **Ours** | | | | | | |
+| UI-Venus-Navi-7B | 97.1 | 92.4 | **86.5** | 76.1 | 87.3 | 71.5 |
+| UI-Venus-Navi-72B | 96.7 | **92.9** | 85.9 | **77.2** | 87.2 | 72.4 |
+> **Table:** Performance comparison on offline UI navigation datasets including AndroidControl and GUI-Odyssey. Note that models with * are reproduced.
 # Citation
 Please consider citing if you find our work useful:
 ```plain
       primaryClass={cs.CV},
       url={https://arxiv.org/abs/2508.10833},
 }
+```