Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -15,6 +15,10 @@ Visualisation?
|
|
| 15 |
This Organisation contains all artefacts released with the paper [ColPali: Efficient Document Retrieval with Vision Language Models.]() [TODO add link],
|
| 16 |
including the [ViDoRe](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) benchmark and our SOTA document retrieval model [*ColPali*](https://huggingface.co/vidore/colpali).
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
|
| 20 |
### Abstract
|
|
@@ -25,17 +29,51 @@ The inherent shortcomings of modern systems motivate the introduction of a new r
|
|
| 25 |
Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
|
| 26 |
|
| 27 |
## Organisation
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
- [*ColPali*](https://huggingface.co/vidore/colpali): TODO
|
| 36 |
- [*BiPali*](https://huggingface.co/vidore/bipali): TODO
|
| 37 |
- [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
|
| 38 |
|
|
|
|
| 39 |
## Autorship + Citation
|
| 40 |
|
| 41 |
TODO : Contact
|
|
@@ -46,4 +84,4 @@ If you use any datasets or models from this organisation in your research, pleas
|
|
| 46 |
**BibTeX Citation**
|
| 47 |
```latex
|
| 48 |
[include BibTeX]
|
| 49 |
-
|
|
|
|
| 15 |
This Organisation contains all artefacts released with the paper [ColPali: Efficient Document Retrieval with Vision Language Models.]() [TODO add link],
|
| 16 |
including the [ViDoRe](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) benchmark and our SOTA document retrieval model [*ColPali*](https://huggingface.co/vidore/colpali).
|
| 17 |
|
| 18 |
+
A repository with **evaluation** scripts can be found here. [GitHub](https://github.com/tonywu71/vidore-benchmark)
|
| 19 |
+
|
| 20 |
+
A repository with **training** scripts can be found here. [GitHub](https://github.com/ManuelFay/colpali)
|
| 21 |
+
|
| 22 |
|
| 23 |
|
| 24 |
### Abstract
|
|
|
|
| 29 |
Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
|
| 30 |
|
| 31 |
## Organisation
|
| 32 |
+
|
| 33 |
+
### Datasets
|
| 34 |
+
We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
|
| 35 |
+
- [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
|
| 36 |
+
datasets ([ArXiVQA](https://huggingface.co/datasets/vidore/arxivqa_test_subsampled), [DocVQA](https://huggingface.co/datasets/vidore/docvqa_test_subsampled),
|
| 37 |
+
[InfoVQA](https://huggingface.co/datasets/vidore/infovqa_test_subsampled), [TATDQA](https://huggingface.co/datasets/vidore/tatdqa_test), [TabFQuAD](https://huggingface.co/datasets/vidore/tabfquad_test_subsampled)) and from datasets synthetically generated spanning various themes and industrial applications:
|
| 38 |
+
([Artificial Intelligence](https://huggingface.co/datasets/vidore/syntheticDocQA_artificial_intelligence_test), [Government Reports](https://huggingface.co/datasets/vidore/syntheticDocQA_government_reports_test), [Healthcare Industry](https://huggingface.co/datasets/vidore/syntheticDocQA_healthcare_industry_test), [Energy](https://huggingface.co/datasets/vidore/syntheticDocQA_energy_test) and [Shift Project](https://huggingface.co/datasets/vidore/shiftproject_test)). Further details can be found in the corresponding dataset cards.
|
| 39 |
+
- [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with Tesseract.
|
| 40 |
+
- [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
|
| 41 |
+
|
| 42 |
+
**Intended use**
|
| 43 |
+
|
| 44 |
+
You can either load a specific dataset using the standard `load_dataset` function from huggingface.
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
from datasets import load_dataset
|
| 48 |
+
dataset = load_dataset(dataset_item.item_id)
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
To use the whole benchmark you can list the datasets in the collection using the following snippet.
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
from datasets import load_dataset
|
| 55 |
+
import huggingface_hub
|
| 56 |
+
|
| 57 |
+
#import collection using its hugging face URL
|
| 58 |
+
collection = huggingface_hub.get_collection(vidore/vidore-benchmark-667173f98e70a1c0fa4db00d)
|
| 59 |
+
datasets = []
|
| 60 |
+
|
| 61 |
+
#list datasets and load them
|
| 62 |
+
datasets_items = collection.items
|
| 63 |
+
for dataset_item in datasets_items:
|
| 64 |
+
print(f"Loading {dataset_item.item_id}")
|
| 65 |
+
dataset = load_dataset(dataset_item.item_id)
|
| 66 |
+
datasets.append(dataset)
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
### Models [add description of released model]
|
| 72 |
- [*ColPali*](https://huggingface.co/vidore/colpali): TODO
|
| 73 |
- [*BiPali*](https://huggingface.co/vidore/bipali): TODO
|
| 74 |
- [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
|
| 75 |
|
| 76 |
+
|
| 77 |
## Autorship + Citation
|
| 78 |
|
| 79 |
TODO : Contact
|
|
|
|
| 84 |
**BibTeX Citation**
|
| 85 |
```latex
|
| 86 |
[include BibTeX]
|
| 87 |
+
```
|