Add model card for ModernVBERT

This PR adds a comprehensive model card for the ModernVBERT model, linking it to the paper [ModernVBERT: Towards Smaller Visual Document Retrievers](https://huggingface.co/papers/2510.01149).

It includes the license, `library_name` (transformers), and `pipeline_tag` (visual-document-retrieval) in the metadata for better discoverability and integration on the Hub. The content provides a concise description of the model, an architecture overview, and links to relevant resources including the GitHub repository, Hugging Face organization, blog post, and a Google Colab tutorial for usage.

Please review and merge if everything looks good.

Files changed (1) hide show

README.md +40 -0

README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+---
+pipeline_tag: visual-document-retrieval
+library_name: transformers
+license: apache-2.0
+---
+# ModernVBERT: Towards Smaller Visual Document Retrievers 👁️
+[![Paper](https://img.shields.io/badge/Paper-2510.01149-red?style=for-the-badge&logo=arxiv&labelColor=black)](https://huggingface.co/papers/2510.01149)
+[![HuggingFace Org](https://img.shields.io/badge/HuggingFace-yellow?style=for-the-badge&logo=huggingface&labelColor=black)](https://huggingface.co/ModernVBERT)
+[![GitHub](https://img.shields.io/badge/GitHub-code-keygen.svg?logo=github&style=for-the-badge)](https://github.com/illuin-tech/modernvbert)
+[![Blog Post](https://img.shields.io/badge/Blog_Post-018EF5?logo=readme&logoColor=fff&labelColor=black&style=for-the-badge)](https://huggingface.co/blog/paultltc/modernvbert)
+This repository contains the **ModernVBERT** model, a compact 250M-parameter vision-language encoder designed for efficient Visual Document Retrieval (VDR). As presented in the paper "[ModernVBERT: Towards Smaller Visual Document Retrievers](https://huggingface.co/papers/2510.01149)", this model establishes a principled recipe for improving VDR models by revisiting the entire training pipeline. It outperforms models up to 10 times larger while enabling efficient inference on CPU hardware, significantly reducing latency and costs. Key factors measured for improvement include attention masking, image resolution, modality alignment data regimes, and late interaction-centered contrastive objectives.
+<div align="center">
+  <img src="https://github.com/illuin-tech/modernvbert/raw/main/assets/imgs/architecture.png" alt="ModernVBERT Architecture" width="700">
+</div>
+## Usage
+A detailed tutorial for fine-tuning and using ModernVBERT, including all information required to launch a model post-training, is available in a Google Colab notebook:
+[Go to Tutorial](https://colab.research.google.com/drive/1bT5LWeO1gPL83GKUZsFeFEleHmEDEQRy)
+## Citation
+If you use ModernVBERT in your research, please cite the paper as follows:
+```latex
+@misc{teiletche2025modernvbertsmallervisualdocument,
+      title={ModernVBERT: Towards Smaller Visual Document Retrievers},
+      author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
+      year={2025},
+      eprint={2510.01149},
+      archivePrefix={arXiv},
+      primaryClass={cs.IR},
+      url={https://arxiv.org/abs/2510.01149},
+}
+```