| | --- |
| | license: odc-by |
| | --- |
| | #### Model for the paper: [Harnessing Webpage Uis For Text Rich Visual Understanding](https://arxiv.org/abs/2410.13824) |
| |
|
| | 🌐 [Homepage](https://neulab.github.io/MultiUI/) | 🐍 [GitHub](https://github.com/neulab/multiui) | 📖 [arXiv](https://arxiv.org/abs/2410.13824) |
| |
|
| | ## Introduction |
| | We introduce **MultiUI**, a dataset containing 7.3 million samples from 1 million websites, covering diverse multi- modal tasks and UI layouts. Models trained on **MultiUI** not only excel in web UI tasks—achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in action accuracy on a web agent dataset Mind2Web—but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. |
| |
|
| | <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/65403d8781a8731a1c09a584/vk7yT4Y7ydBOHM6BojmlI.mp4"></video> |
| |
|
| | ## Model Performance |
| |
|
| |  |
| |
|
| |  |
| |
|
| |  |
| |
|
| | ## Contact |
| | * Junpeng Liu: jpliu@link.cuhk.edu.hk |
| | * Xiang Yue: xyue2@andrew.cmu.edu |
| |
|
| | ## Citation |
| | If you find this work helpful, please cite out paper: |
| | ```` |
| | @misc{liu2024harnessingwebpageuistextrich, |
| | title={Harnessing Webpage UIs for Text-Rich Visual Understanding}, |
| | author={Junpeng Liu and Tianyue Ou and Yifan Song and Yuxiao Qu and Wai Lam and Chenyan Xiong and Wenhu Chen and Graham Neubig and Xiang Yue}, |
| | year={2024}, |
| | eprint={2410.13824}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2410.13824}, |
| | } |
| | ```` |