GUI-AIMA-3B / README.md

nielsr HF Staff

Add model card for GUI-AIMA-3B

c283785 verified 24 days ago

2.27 kB

library_name: transformers
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen2.5-VL-3B-Instruct

GUI-AIMA-3B

GUI-AIMA (Aligning Intrinsic Multimodal Attention) is an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding, as introduced in the paper GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding.

The model aligns the intrinsic multimodal attention of Multimodal Large Language Models (MLLMs) with patch-wise grounding signals. This approach is highly data-efficient and allows for the integration of a plug-and-play zoom-in stage for high-resolution grounding without additional fine-tuning.

Model Details

Architecture: Based on Qwen2.5-VL-3B-Instruct with a context anchor mechanism for attention-based grounding.
Task: GUI Grounding (mapping instructions to actionable screen regions).
Paper: GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding
GitHub Repository: https://github.com/sjz5202/GUI-AIMA

Performance

GUI-AIMA-3B achieves state-of-the-art performance among 3B-parameter models on several GUI grounding benchmarks:

Benchmark	Accuracy (1-step)	Accuracy (2-step Zoom-in)
ScreenSpot-Pro	53.8%	61.5%
OSWorld-G	62.8%	68.1%
ScreenSpot-v2	92.1%	-
MMBench-GUI-L2	79.1%	-
UI-Vision	60.0%	-

Usage

The model requires custom code from the official repository for inference. Please refer to the GitHub repository for installation instructions and example scripts (e.g., eval/example_inference.py).

Citation

@misc{zhou2025guiaimaaligningintrinsicmultimodal,
      title={GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding}, 
      author={Shijie Zhou and Viet Dac Lai and Hao Tan and Jihyung Kil and Wanrong Zhu and Changyou Chen and Ruiyi Zhang},
      year={2025},
      eprint={2511.00810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.00810}, 
}