Improve model card: Add metadata, structured paper link, project page, and code links

This PR significantly improves the model card for AdaptVision by enriching its content and metadata.

Key changes include:
- Adding `pipeline_tag: image-text-to-text` to the metadata for better discoverability and categorization on the Hub, reflecting the model's Vision-Language capabilities.
- Adding `library_name: transformers` to the metadata, as evidence from `config.json` (`"architectures": ["Qwen2_5_VLForConditionalGeneration"]`) and the GitHub README (`pip install transformers==4.51.0`) indicates compatibility with the Transformers library. This enables the automated "how to use" widget on the model page.
- Enhancing the existing arXiv link by formatting it as `[AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition](https://arxiv.org/abs/2512.03794)`.
- Adding direct links to the official [project page](https://adaptvision.github.io/) and the [GitHub repository](https://github.com/AdaptVision/AdaptVision) for further details and code access.
- Including a concise description of the AdaptVision model, derived from the paper's abstract.
- Adding the BibTeX citation for proper academic referencing.

These updates aim to provide users with a more comprehensive and well-structured overview of the AdaptVision model.

Files changed (1) hide show

README.md +25 -1

README.md CHANGED Viewed

@@ -1,4 +1,28 @@
 ---
 license: apache-2.0
 ---
-arxiv.org/abs/2512.03794

 ---
 license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
+# AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
+AdaptVision is an efficient Vision-Language Model (VLM) paradigm designed to achieve adaptive visual token acquisition through a coarse-to-fine approach. Inspired by human active vision mechanisms, this model addresses the significant computational overhead in VLMs by autonomously determining the minimum number of visual tokens required for each sample. It selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary.
+The model was presented in the paper:
+[AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition](https://arxiv.org/abs/2512.03794)
+For more details, please visit the [project page](https://adaptvision.github.io/).
+The official code can be found on the [GitHub repository](https://github.com/AdaptVision/AdaptVision).
+## Citation
+If you find this project useful in your research, please consider citing:
+```bibtex
+@article{lin2025adapt,
+  title={AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition},
+  author={Zichuan Lin and Yicheng Liu and Yang Yang and Lvfang Tao and Deheng Ye},
+  journal={arXiv preprint arXiv:2512.03794},
+  year={2025}
+}
+```