library_name: transformers
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen2.5-VL-3B-Instruct
GUI-AIMA-3B
GUI-AIMA (Aligning Intrinsic Multimodal Attention) is an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding, as introduced in the paper GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding.
The model aligns the intrinsic multimodal attention of Multimodal Large Language Models (MLLMs) with patch-wise grounding signals. This approach is highly data-efficient and allows for the integration of a plug-and-play zoom-in stage for high-resolution grounding without additional fine-tuning.
Model Details
- Architecture: Based on Qwen2.5-VL-3B-Instruct with a context anchor mechanism for attention-based grounding.
- Task: GUI Grounding (mapping instructions to actionable screen regions).
- Paper: GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
- GitHub Repository: https://github.com/sjz5202/GUI-AIMA
Performance
GUI-AIMA-3B achieves state-of-the-art performance among 3B-parameter models on several GUI grounding benchmarks:
| Benchmark | Accuracy (1-step) | Accuracy (2-step Zoom-in) |
|---|---|---|
| ScreenSpot-Pro | 53.8% | 61.5% |
| OSWorld-G | 62.8% | 68.1% |
| ScreenSpot-v2 | 92.1% | - |
| MMBench-GUI-L2 | 79.1% | - |
| UI-Vision | 60.0% | - |
Usage
The model requires custom code from the official repository for inference. Please refer to the GitHub repository for installation instructions and example scripts (e.g., eval/example_inference.py).
Citation
@misc{zhou2025guiaimaaligningintrinsicmultimodal,
title={GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding},
author={Shijie Zhou and Viet Dac Lai and Hao Tan and Jihyung Kil and Wanrong Zhu and Changyou Chen and Ruiyi Zhang},
year={2025},
eprint={2511.00810},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.00810},
}