Commit ·
e47bee1
1
Parent(s): e9ee3c6
update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
+
# UI-TARS: Pioneering Automated GUI Interaction with Native Agents
|
| 5 |
+
|
| 6 |
+
## Overview
|
| 7 |
+
UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
|
| 8 |
+
|
| 9 |
+
## Core Features
|
| 10 |
+
### Perception
|
| 11 |
+
- **Comprehensive GUI Understanding**: Processes multimodal inputs (text, images, interactions) to build a coherent understanding of interfaces.
|
| 12 |
+
- **Real-Time Interaction**: Continuously monitors dynamic GUIs and responds accurately to changes in real-time.
|
| 13 |
+
|
| 14 |
+
### Action
|
| 15 |
+
- **Unified Action Space**: Standardized action definitions across platforms (desktop, mobile, and web).
|
| 16 |
+
- **Platform-Specific Actions**: Supports additional actions like hotkeys, long press, and platform-specific gestures.
|
| 17 |
+
|
| 18 |
+
### Reasoning
|
| 19 |
+
- **System 1 & System 2 Reasoning**: Combines fast, intuitive responses with deliberate, high-level planning for complex tasks.
|
| 20 |
+
- **Task Decomposition & Reflection**: Supports multi-step planning, reflection, and error correction for robust task execution.
|
| 21 |
+
|
| 22 |
+
### Memory
|
| 23 |
+
- **Short-Term Memory**: Captures task-specific context for situational awareness.
|
| 24 |
+
- **Long-Term Memory**: Retains historical interactions and knowledge for improved decision-making.
|
| 25 |
+
|
| 26 |
+
## Capabilities
|
| 27 |
+
- **Cross-Platform Interaction**: Supports desktop, mobile, and web environments with a unified action framework.
|
| 28 |
+
- **Multi-Step Task Execution**: Trained to handle complex tasks through multi-step trajectories and reasoning.
|
| 29 |
+
- **Learning from Synthetic and Real Data**: Combines large-scale annotated and synthetic datasets for improved generalization and robustness.
|
| 30 |
+
|
| 31 |
+
## Training Pipeline
|
| 32 |
+
1. **Pre-Training**: Leveraging large-scale GUI-specific datasets for foundational learning.
|
| 33 |
+
2. **Supervised Fine-Tuning**: Fine-tuning on human-annotated and synthetic multi-step task data.
|
| 34 |
+
3. **Continual Learning**: Employing online trace bootstrapping and reinforcement learning for continual improvement.
|
| 35 |
+
|
| 36 |
+
## Evaluation Metrics
|
| 37 |
+
- **Step-Level Metrics**: Element accuracy, operation F1 score, and step success rate.
|
| 38 |
+
- **Task-Level Metrics**: Complete match and partial match scores for overall task success.
|
| 39 |
+
- **Other Metrics**: Measures for execution efficiency, safety, robustness, and adaptability.
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## License
|
| 43 |
+
UI-TARS is licensed under the Apache License 2.0.
|
| 44 |
+
|
| 45 |
+
## Acknowledgements
|
| 46 |
+
This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
|
| 47 |
+
|
| 48 |
+
Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.
|