Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,73 @@
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: yellow
|
| 6 |
sdk: static
|
| 7 |
-
pinned:
|
| 8 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
-
Edit this `README.md` markdown file to author your organization card.
|
|
|
|
| 1 |
---
|
| 2 |
title: README
|
| 3 |
+
emoji: 👀
|
| 4 |
+
colorFrom: indigo
|
| 5 |
colorTo: yellow
|
| 6 |
sdk: static
|
| 7 |
+
pinned: true
|
| 8 |
---
|
| 9 |
+
# ColPali: Efficient Document Retrieval with Vision Language Models 👀
|
| 10 |
+
|
| 11 |
+
[](https://arxiv.org/abs/XXX)
|
| 12 |
+
|
| 13 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/T3z7_Biq3oW6b8I9ZwpIa.png" width="800">
|
| 14 |
+
|
| 15 |
+
This organization contains all artifacts released with our preprint [*Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings *](https://arxiv.org/abs/XXX),
|
| 16 |
+
including the [ConTEB](https://huggingface.co/collections/illuin-conteb/conteb-datasets-6839fffd25f1d3685f3ad604) benchmark.
|
| 17 |
+
|
| 18 |
+
### Abstract
|
| 19 |
+
|
| 20 |
+
A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations.
|
| 21 |
+
|
| 22 |
+
In this work, we introduce *ConTEB* (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose *InSeNT* (In-sequence Negative Training), a novel contrastive post-training approach which combined with \textit{late chunking} pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on *ConTEB* without sacrificing base model performance.
|
| 23 |
+
We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.
|
| 24 |
+
We open-source all artifacts here and at https://github.com/illuin-tech/contextual-embeddings.
|
| 25 |
+
|
| 26 |
+
## Models
|
| 27 |
+
|
| 28 |
+
- TODO
|
| 29 |
+
|
| 30 |
+
## Benchmark
|
| 31 |
+
|
| 32 |
+
- [*Leaderboard*]TODO
|
| 33 |
+
-
|
| 34 |
+
## Datasets
|
| 35 |
+
|
| 36 |
+
We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
|
| 37 |
+
|
| 38 |
+
- [*ConTEB Benchmark*](TODO)
|
| 39 |
+
-
|
| 40 |
+
## Code
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
CHANGE
|
| 44 |
+
|
| 45 |
+
- [*ColPali Engine*](https://github.com/illuin-tech/colpali): The code used to train and run inference with the ColPali architecture.
|
| 46 |
+
- [*ViDoRe Benchmark*](https://github.com/illuin-tech/vidore-benchmark): A Python package/CLI tool to evaluate document retrieval systems on the ViDoRe benchmark.
|
| 47 |
+
|
| 48 |
+
## Extra
|
| 49 |
+
|
| 50 |
+
- [*Blog*](https://huggingface.co/XXX: TODO
|
| 51 |
+
- [*Preprint*](https://huggingface.co/XXX): The paper with all details !
|
| 52 |
+
|
| 53 |
+
## Contact
|
| 54 |
+
|
| 55 |
+
- Manuel Faysse: manuel.faysse@illuin.tech
|
| 56 |
+
- Max Conti: max.conti@illuin.tech
|
| 57 |
+
|
| 58 |
+
## Citation
|
| 59 |
+
|
| 60 |
+
If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
|
| 61 |
+
|
| 62 |
+
```latex
|
| 63 |
+
@misc{
|
| 64 |
+
}
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
## Acknowledgments
|
| 68 |
+
|
| 69 |
+
This work is partially supported by [ILLUIN Technology](https://www.illuin.tech/), and by a grant from ANRT France.
|
| 70 |
+
This work was performed using HPC resources from the Jeanzay supercomputer with grant XXX.
|
| 71 |
+
TODO.
|
| 72 |
+
|
| 73 |
|
|
|