GUI-AIMA-3B / README.md
nielsr's picture
nielsr HF Staff
Add model card for GUI-AIMA-3B
c283785 verified
|
raw
history blame
2.27 kB
metadata
library_name: transformers
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen2.5-VL-3B-Instruct

GUI-AIMA-3B

GUI-AIMA (Aligning Intrinsic Multimodal Attention) is an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding, as introduced in the paper GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding.

The model aligns the intrinsic multimodal attention of Multimodal Large Language Models (MLLMs) with patch-wise grounding signals. This approach is highly data-efficient and allows for the integration of a plug-and-play zoom-in stage for high-resolution grounding without additional fine-tuning.

Model Details

Performance

GUI-AIMA-3B achieves state-of-the-art performance among 3B-parameter models on several GUI grounding benchmarks:

Benchmark Accuracy (1-step) Accuracy (2-step Zoom-in)
ScreenSpot-Pro 53.8% 61.5%
OSWorld-G 62.8% 68.1%
ScreenSpot-v2 92.1% -
MMBench-GUI-L2 79.1% -
UI-Vision 60.0% -

Usage

The model requires custom code from the official repository for inference. Please refer to the GitHub repository for installation instructions and example scripts (e.g., eval/example_inference.py).

Citation

@misc{zhou2025guiaimaaligningintrinsicmultimodal,
      title={GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding}, 
      author={Shijie Zhou and Viet Dac Lai and Hao Tan and Jihyung Kil and Wanrong Zhu and Changyou Chen and Ruiyi Zhang},
      year={2025},
      eprint={2511.00810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.00810}, 
}