File size: 1,346 Bytes

f3a98cb
 
6f60d3d
 
f3a98cb
6f60d3d

---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision is an efficient Vision-Language Model (VLM) paradigm designed to achieve adaptive visual token acquisition through a coarse-to-fine approach. Inspired by human active vision mechanisms, this model addresses the significant computational overhead in VLMs by autonomously determining the minimum number of visual tokens required for each sample. It selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary.

The model was presented in the paper:
[AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition](https://arxiv.org/abs/2512.03794)

For more details, please visit the [project page](https://adaptvision.github.io/).
The official code can be found on the [GitHub repository](https://github.com/AdaptVision/AdaptVision).

## Citation

If you find this project useful in your research, please consider citing:

```bibtex
@article{lin2025adapt,
  title={AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition},
  author={Zichuan Lin and Yicheng Liu and Yang Yang and Lvfang Tao and Deheng Ye},
  journal={arXiv preprint arXiv:2512.03794},
  year={2025}
}
```