Spaces:

illuin-conteb
/

README

Running

App Files Files Community

manu commited on May 30, 2025

Commit

7f21553

verified ·

1 Parent(s): dfc3463

Update README.md

Browse files

Files changed (1) hide show

README.md +67 -4

README.md CHANGED Viewed

@@ -1,10 +1,73 @@
 ---
 title: README
-emoji: 🐨
-colorFrom: purple
 colorTo: yellow
 sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
 title: README
+emoji: 👀
+colorFrom: indigo
 colorTo: yellow
 sdk: static
+pinned: true
 ---
+# ColPali: Efficient Document Retrieval with Vision Language Models 👀
+[![arXiv](https://img.shields.io/badge/arXiv-2407.01449-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/XXX)
+<img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/T3z7_Biq3oW6b8I9ZwpIa.png" width="800">
+This organization contains all artifacts released with our preprint [*Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings *](https://arxiv.org/abs/XXX),
+including the [ConTEB](https://huggingface.co/collections/illuin-conteb/conteb-datasets-6839fffd25f1d3685f3ad604) benchmark.
+### Abstract
+A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations.
+In this work, we introduce *ConTEB* (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose *InSeNT* (In-sequence Negative Training), a novel contrastive post-training approach which combined with \textit{late chunking} pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on *ConTEB* without sacrificing base model performance.
+We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.
+We open-source all artifacts here and at https://github.com/illuin-tech/contextual-embeddings.
+## Models
+- TODO
+## Benchmark
+- [*Leaderboard*]TODO
+-
+## Datasets
+We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
+- [*ConTEB Benchmark*](TODO)
+-
+## Code
+CHANGE
+- [*ColPali Engine*](https://github.com/illuin-tech/colpali): The code used to train and run inference with the ColPali architecture.
+- [*ViDoRe Benchmark*](https://github.com/illuin-tech/vidore-benchmark): A Python package/CLI tool to evaluate document retrieval systems on the ViDoRe benchmark.
+## Extra
+- [*Blog*](https://huggingface.co/XXX: TODO
+- [*Preprint*](https://huggingface.co/XXX): The paper with all details !
+## Contact
+- Manuel Faysse: manuel.faysse@illuin.tech
+- Max Conti: max.conti@illuin.tech
+## Citation
+If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
+```latex
+@misc{
+}
+```
+## Acknowledgments
+This work is partially supported by [ILLUIN Technology](https://www.illuin.tech/), and by a grant from ANRT France.
+This work was performed using HPC resources from the Jeanzay supercomputer with grant XXX.
+TODO.