Spaces:
Running
Running
| title: README | |
| emoji: 👀 | |
| colorFrom: indigo | |
| colorTo: red | |
| sdk: static | |
| pinned: true | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/x99xqw__fl2UaPbiIdC_f.png" width="180" style="display: block; margin-left: auto; margin-right: auto;" /> | |
| <h1 align="center">ViDoRe: Visual Document Retrieval 👀</h1> | |
| > [!IMPORTANT] | |
| > **ViDoRe V3 : our new benchmark release!** | |
| > | |
| > Developed by ILLUIN Technology and with contributions from NVIDIA, it is the most diverse visual document retrieval benchmark to date for **enterprise applications**. | |
| > It includes **10 datasets**, **26,000+ pages**, **3000+ queries** in **6 languages**. | |
| > Check out : [the paper](https://arxiv.org/pdf/2601.08620) or [the blogpost](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3)! | |
| This organization contains models, datasets, benchmarks and code released with the ViDoRe project by Illuin Technology. | |
| - **Leaderboard**: | |
| - [ViDoRe Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard) | |
| - **Benchmarks:** | |
| - ViDoRe V1 ([paper](https://arxiv.org/abs/2407.01449), [blogpost](https://huggingface.co/blog/manu/colpali), [dataset collection](https://huggingface.co/collections/vidore/vidore-benchmark)) | |
| - ViDoRe V2 ([paper](https://arxiv.org/abs/2505.17166), [blogpost](https://huggingface.co/blog/manu/vidore-v2), [dataset collection](https://huggingface.co/collections/vidore/vidore-benchmark-v2)) | |
| - ViDoRe V3 ([paper](https://arxiv.org/abs/2601.08620), [blogpost](https://huggingface.co/blog/QuentinJG/introducing-vidore-v3), [dataset collection](https://hf.co/collections/vidore/vidore-benchmark-v3)) | |
| - **Models:** | |
| - ColPali ([latest: v1.3](https://huggingface.co/vidore/colpali-v1.3)) | |
| - ColQwen2 ([latest: v1.0](https://huggingface.co/vidore/colqwen2-v1.0)) | |
| - ColQwen2.5 ([latest: v0.2](https://huggingface.co/vidore/colqwen2.5-v0.2)) | |
| - ColSmol ([256M](https://huggingface.co/vidore/colSmol-256M) & [500M](https://huggingface.co/vidore/colSmol-500M)) | |
| - ModernVBERT ([latest: v1.0](https://huggingface.co/ModernVBERT)) | |
| --- | |
| # 👷♂️ ViDoRe V3: A comprehensive evaluation of Retrieval for enterprise use-cases | |
| [](https://arxiv.org/abs/2601.08620) | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/66e16a677c2eb2da5109fb5c/-zqFfhdtsC1VzQH-rLkLa.png" width="1300" style="display: block; margin-left: auto; margin-right: auto;" /> | |
| ILLUIN Technology is proud to release the **ViDoRe V3 benchmark**, designed and developed with contributions from NVIDIA. ViDoRe V3 is our latest benchmark, engineered to set a new industry gold standard for multi-modal, enterprise document retrieval evaluation. It addresses a critical challenge in production RAG systems: retrieving accurate information from complex, visually-rich documents. | |
| ViDoRe V3 improves on existing RAG benchmarks by prioritizing enterprise relevance and rigorous data quality. Instead of relying on clean academic texts, the benchmark draws from 10 challenging, real-world datasets spanning diverse industrial domains, with 8 publicly released and 2 kept private. In addition, while previous benchmarks often rely on synthetically generated data, ViDoRe V3 features human-created and human-verified annotations. | |
| This benchmark contains 26,000 pages and 3,099 queries translated into 6 languages. Each query is linked to retrieval ground truth data created and verified by human annotators: relevant pages, precise bounding box annotations for key elements, and a comprehensive reference answer. | |
| --- | |
| # 👀 ColPali: Efficient Document Retrieval with Vision Language Models | |
| [](https://arxiv.org/abs/2407.01449) | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/T3z7_Biq3oW6b8I9ZwpIa.png" width="800" style="display: block; margin-left: auto; margin-right: auto;" /> | |
| Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. | |
| While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. | |
| To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark *ViDoRe*, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. | |
| The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, *ColPali*, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. | |
| Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable. | |
| --- | |
| ## Contact | |
| - Quentin Macé: `quentin.mace@illuin.tech` | |
| --- | |
| ## Citation | |
| If you use any datasets or models from this organization in your research, please cite the original dataset as follows: | |
| ```latex | |
| @misc{faysse2024colpaliefficientdocumentretrieval, | |
| title={ColPali: Efficient Document Retrieval with Vision Language Models}, | |
| author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, | |
| year={2024}, | |
| eprint={2407.01449}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.IR}, | |
| url={https://arxiv.org/abs/2407.01449}, | |
| } | |
| @misc{macé2025vidorebenchmarkv2raising, | |
| title={ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval}, | |
| author={Quentin Macé and António Loison and Manuel Faysse}, | |
| year={2025}, | |
| eprint={2505.17166}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.IR}, | |
| url={https://arxiv.org/abs/2505.17166}, | |
| } | |
| @misc{loison2026vidorev3comprehensiveevaluation, | |
| title={ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios}, | |
| author={António Loison and Quentin Macé and Antoine Edy and Victor Xing and Tom Balough and Gabriel Moreira and Bo Liu and Manuel Faysse and Céline Hudelot and Gautier Viaud}, | |
| year={2026}, | |
| eprint={2601.08620}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2601.08620}, | |
| } | |
| ``` | |
| --- | |
| ## Acknowledgments | |
| This work is partially supported by [ILLUIN Technology](https://www.illuin.tech/). | |