HugSib commited on
Commit
470273b
·
verified ·
1 Parent(s): 3eda464

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -8
README.md CHANGED
@@ -15,6 +15,10 @@ Visualisation?
15
  This Organisation contains all artefacts released with the paper [ColPali: Efficient Document Retrieval with Vision Language Models.]() [TODO add link],
16
  including the [ViDoRe](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) benchmark and our SOTA document retrieval model [*ColPali*](https://huggingface.co/vidore/colpali).
17
 
 
 
 
 
18
 
19
 
20
  ### Abstract
@@ -25,17 +29,51 @@ The inherent shortcomings of modern systems motivate the introduction of a new r
25
  Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
26
 
27
  ## Organisation
28
- - **Datasets**: [add description of each collection + link]
29
- - [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
30
- datasets (ArXiVQA, DocVQA, InfoVQA, TATDQA, TabFQuAD) and from datasets synthetically generated spanning various themes and industrial application:
31
- (Artificial Intelligence, Government Reports, Healthcare Industry, Energy and Shift Project).
32
- - [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with tesseract.
33
- - [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
34
- - **Models**: [add description of released model]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  - [*ColPali*](https://huggingface.co/vidore/colpali): TODO
36
  - [*BiPali*](https://huggingface.co/vidore/bipali): TODO
37
  - [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
38
 
 
39
  ## Autorship + Citation
40
 
41
  TODO : Contact
@@ -46,4 +84,4 @@ If you use any datasets or models from this organisation in your research, pleas
46
  **BibTeX Citation**
47
  ```latex
48
  [include BibTeX]
49
- }
 
15
  This Organisation contains all artefacts released with the paper [ColPali: Efficient Document Retrieval with Vision Language Models.]() [TODO add link],
16
  including the [ViDoRe](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) benchmark and our SOTA document retrieval model [*ColPali*](https://huggingface.co/vidore/colpali).
17
 
18
+ A repository with **evaluation** scripts can be found here. [GitHub](https://github.com/tonywu71/vidore-benchmark)
19
+
20
+ A repository with **training** scripts can be found here. [GitHub](https://github.com/ManuelFay/colpali)
21
+
22
 
23
 
24
  ### Abstract
 
29
  Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
30
 
31
  ## Organisation
32
+
33
+ ### Datasets
34
+ We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
35
+ - [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
36
+ datasets ([ArXiVQA](https://huggingface.co/datasets/vidore/arxivqa_test_subsampled), [DocVQA](https://huggingface.co/datasets/vidore/docvqa_test_subsampled),
37
+ [InfoVQA](https://huggingface.co/datasets/vidore/infovqa_test_subsampled), [TATDQA](https://huggingface.co/datasets/vidore/tatdqa_test), [TabFQuAD](https://huggingface.co/datasets/vidore/tabfquad_test_subsampled)) and from datasets synthetically generated spanning various themes and industrial applications:
38
+ ([Artificial Intelligence](https://huggingface.co/datasets/vidore/syntheticDocQA_artificial_intelligence_test), [Government Reports](https://huggingface.co/datasets/vidore/syntheticDocQA_government_reports_test), [Healthcare Industry](https://huggingface.co/datasets/vidore/syntheticDocQA_healthcare_industry_test), [Energy](https://huggingface.co/datasets/vidore/syntheticDocQA_energy_test) and [Shift Project](https://huggingface.co/datasets/vidore/shiftproject_test)). Further details can be found in the corresponding dataset cards.
39
+ - [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with Tesseract.
40
+ - [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
41
+
42
+ **Intended use**
43
+
44
+ You can either load a specific dataset using the standard `load_dataset` function from huggingface.
45
+
46
+ ```python
47
+ from datasets import load_dataset
48
+ dataset = load_dataset(dataset_item.item_id)
49
+ ```
50
+
51
+ To use the whole benchmark you can list the datasets in the collection using the following snippet.
52
+
53
+ ```python
54
+ from datasets import load_dataset
55
+ import huggingface_hub
56
+
57
+ #import collection using its hugging face URL
58
+ collection = huggingface_hub.get_collection(vidore/vidore-benchmark-667173f98e70a1c0fa4db00d)
59
+ datasets = []
60
+
61
+ #list datasets and load them
62
+ datasets_items = collection.items
63
+ for dataset_item in datasets_items:
64
+ print(f"Loading {dataset_item.item_id}")
65
+ dataset = load_dataset(dataset_item.item_id)
66
+ datasets.append(dataset)
67
+
68
+ ```
69
+
70
+
71
+ ### Models [add description of released model]
72
  - [*ColPali*](https://huggingface.co/vidore/colpali): TODO
73
  - [*BiPali*](https://huggingface.co/vidore/bipali): TODO
74
  - [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
75
 
76
+
77
  ## Autorship + Citation
78
 
79
  TODO : Contact
 
84
  **BibTeX Citation**
85
  ```latex
86
  [include BibTeX]
87
+ ```