manu commited on
Commit
7f21553
·
verified ·
1 Parent(s): dfc3463

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -4
README.md CHANGED
@@ -1,10 +1,73 @@
1
  ---
2
  title: README
3
- emoji: 🐨
4
- colorFrom: purple
5
  colorTo: yellow
6
  sdk: static
7
- pinned: false
8
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
1
  ---
2
  title: README
3
+ emoji: 👀
4
+ colorFrom: indigo
5
  colorTo: yellow
6
  sdk: static
7
+ pinned: true
8
  ---
9
+ # ColPali: Efficient Document Retrieval with Vision Language Models 👀
10
+
11
+ [![arXiv](https://img.shields.io/badge/arXiv-2407.01449-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/XXX)
12
+
13
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/60f2e021adf471cbdf8bb660/T3z7_Biq3oW6b8I9ZwpIa.png" width="800">
14
+
15
+ This organization contains all artifacts released with our preprint [*Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings *](https://arxiv.org/abs/XXX),
16
+ including the [ConTEB](https://huggingface.co/collections/illuin-conteb/conteb-datasets-6839fffd25f1d3685f3ad604) benchmark.
17
+
18
+ ### Abstract
19
+
20
+ A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations.
21
+
22
+ In this work, we introduce *ConTEB* (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose *InSeNT* (In-sequence Negative Training), a novel contrastive post-training approach which combined with \textit{late chunking} pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on *ConTEB* without sacrificing base model performance.
23
+ We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.
24
+ We open-source all artifacts here and at https://github.com/illuin-tech/contextual-embeddings.
25
+
26
+ ## Models
27
+
28
+ - TODO
29
+
30
+ ## Benchmark
31
+
32
+ - [*Leaderboard*]TODO
33
+ -
34
+ ## Datasets
35
+
36
+ We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
37
+
38
+ - [*ConTEB Benchmark*](TODO)
39
+ -
40
+ ## Code
41
+
42
+
43
+ CHANGE
44
+
45
+ - [*ColPali Engine*](https://github.com/illuin-tech/colpali): The code used to train and run inference with the ColPali architecture.
46
+ - [*ViDoRe Benchmark*](https://github.com/illuin-tech/vidore-benchmark): A Python package/CLI tool to evaluate document retrieval systems on the ViDoRe benchmark.
47
+
48
+ ## Extra
49
+
50
+ - [*Blog*](https://huggingface.co/XXX: TODO
51
+ - [*Preprint*](https://huggingface.co/XXX): The paper with all details !
52
+
53
+ ## Contact
54
+
55
+ - Manuel Faysse: manuel.faysse@illuin.tech
56
+ - Max Conti: max.conti@illuin.tech
57
+
58
+ ## Citation
59
+
60
+ If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
61
+
62
+ ```latex
63
+ @misc{
64
+ }
65
+ ```
66
+
67
+ ## Acknowledgments
68
+
69
+ This work is partially supported by [ILLUIN Technology](https://www.illuin.tech/), and by a grant from ANRT France.
70
+ This work was performed using HPC resources from the Jeanzay supercomputer with grant XXX.
71
+ TODO.
72
+
73