| | --- |
| | base_model: |
| | - InternVL/InternVL2-26B |
| | license: apache-2.0 |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | --- |
| | |
| | ## SpiritSight Agent: Advanced GUI Agent with One Look |
| |
|
| | <p align="center"> |
| | <a href="https://arxiv.org/abs/2503.03196">π Paper</a> β’ |
| | <a href="https://huggingface.co/SenseLLM/SpiritSight-Agent-26B">π€ Models</a> β’ |
| | <a href="" style="pointer-events: none">π Datasets (Coming soonβ¦)</a> β’ |
| | <a href="https://hzhiyuan.github.io/SpiritSight-Agent">π Project Page</a> β’ |
| | <a href="https://github.com/THU-BPM/SpiritSight">π» Github Page</a> |
| | </p> |
| | |
| |
|
| | ## Introduction |
| |
|
| | SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. |
| |
|
| |  |
| |  |
| |
|
| |
|
| | ## Models |
| |
|
| | We recommend fine-tuning the base model on custom data. |
| |
|
| | | Model | Checkpoint | Size | License| |
| | |:-------|:------------|:------|:--------| |
| | | SpiritSight-Agent-2B-base | π€ [HF Link](https://huggingface.co/SenseLLM/SpiritSight-Agent-2B) | 2B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) | |
| | | SpiritSight-Agent-8B-base | π€ [HF Link](https://huggingface.co/SenseLLM/SpiritSight-Agent-8B) | 8B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) | |
| | | SpiritSight-Agent-26B-base | π€ [HF Link](https://huggingface.co/SenseLLM/SpiritSight-Agent-26B) | 26B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) | |
| |
|
| |
|
| | ## Datasets |
| |
|
| | Coming soon. |
| |
|
| |
|
| | ## Inference |
| |
|
| | ```shell |
| | conda create -n spiritsight-agent python=3.9 |
| | |
| | pip install -r requirements.txt |
| | pip install flash-attn==2.3.6 --no-build-isolation |
| | |
| | python infer_SSAgent-26B.py |
| | ``` |
| |
|
| |
|
| | ## Citation |
| |
|
| | If you find this repo useful for your research, please kindly cite our paper: |
| | ``` |
| | @misc{huang2025spiritsightagentadvancedgui, |
| | title={SpiritSight Agent: Advanced GUI Agent with One Look}, |
| | author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan}, |
| | year={2025}, |
| | eprint={2503.03196}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2503.03196}, |
| | } |
| | ``` |
| |
|
| |
|
| | ## Acknowledgments |
| |
|
| | We thank the following amazing projects that truly inspired us: |
| |
|
| | - [InternVL2](https://huggingface.co/OpenGVLab/InternVL2-8B) |
| | - [SeeClick]( https://github.com/njucckevin/SeeClick) |
| | - [Mind2Web](https://huggingface.co/datasets/osunlp/Multimodal-Mind2Web) |
| | - [GUI-Odyssey](https://github.com/OpenGVLab/GUI-Odyssey) |
| | - [AMEX](https://huggingface.co/datasets/Yuxiang007/AMEX) |
| | - [AndroidControl](https://github.com/google-research/google-research/tree/master/android_control) |
| | - [GUICourse](https://github.com/yiye3/GUICourse) |