Spaces:

vidore
/

README

Running

App Files Files Community

HugSib commited on Jun 25, 2024

Commit

470273b

verified ·

1 Parent(s): 3eda464

Update README.md

Browse files

Files changed (1) hide show

README.md +46 -8

README.md CHANGED Viewed

@@ -15,6 +15,10 @@ Visualisation?
 This Organisation contains all artefacts released with the paper [ColPali: Efficient Document Retrieval with Vision Language Models.]() [TODO add link],
 including the [ViDoRe](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) benchmark and our SOTA document retrieval model [*ColPali*](https://huggingface.co/vidore/colpali).
 ### Abstract
@@ -25,17 +29,51 @@ The inherent shortcomings of modern systems motivate the introduction of a new r
 Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
 ## Organisation
-- **Datasets**: [add description of each collection + link]
-  - [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
-    datasets (ArXiVQA, DocVQA, InfoVQA, TATDQA, TabFQuAD) and from datasets synthetically generated spanning various themes and industrial application:
-    (Artificial Intelligence, Government Reports, Healthcare Industry, Energy and Shift Project).
-  - [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with tesseract.
-  - [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd):  The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
-- **Models**: [add description of released model]
   - [*ColPali*](https://huggingface.co/vidore/colpali): TODO
   - [*BiPali*](https://huggingface.co/vidore/bipali): TODO
   - [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
 ## Autorship + Citation
 TODO : Contact
@@ -46,4 +84,4 @@ If you use any datasets or models from this organisation in your research, pleas
 **BibTeX Citation**
 ```latex
     [include BibTeX]
-}

 This Organisation contains all artefacts released with the paper [ColPali: Efficient Document Retrieval with Vision Language Models.]() [TODO add link],
 including the [ViDoRe](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) benchmark and our SOTA document retrieval model [*ColPali*](https://huggingface.co/vidore/colpali).
+A repository with **evaluation** scripts can be found here. [GitHub](https://github.com/tonywu71/vidore-benchmark)
+A repository with **training** scripts can be found here. [GitHub](https://github.com/ManuelFay/colpali)
 ### Abstract
 Combined with a late interaction matching mechanism, *ColPali* largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.
 ## Organisation
+### Datasets
+We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
+- [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
+  datasets ([ArXiVQA](https://huggingface.co/datasets/vidore/arxivqa_test_subsampled), [DocVQA](https://huggingface.co/datasets/vidore/docvqa_test_subsampled),
+   [InfoVQA](https://huggingface.co/datasets/vidore/infovqa_test_subsampled), [TATDQA](https://huggingface.co/datasets/vidore/tatdqa_test), [TabFQuAD](https://huggingface.co/datasets/vidore/tabfquad_test_subsampled)) and from datasets synthetically generated spanning various themes and industrial applications:
+  ([Artificial Intelligence](https://huggingface.co/datasets/vidore/syntheticDocQA_artificial_intelligence_test), [Government Reports](https://huggingface.co/datasets/vidore/syntheticDocQA_government_reports_test), [Healthcare Industry](https://huggingface.co/datasets/vidore/syntheticDocQA_healthcare_industry_test), [Energy](https://huggingface.co/datasets/vidore/syntheticDocQA_energy_test) and [Shift Project](https://huggingface.co/datasets/vidore/shiftproject_test)). Further details can be found in the corresponding dataset cards.
+- [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with Tesseract.
+- [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd):  The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
+**Intended use**
+You can either load a specific dataset using the standard `load_dataset` function from huggingface.
+```python
+  from datasets import load_dataset
+  dataset = load_dataset(dataset_item.item_id)
+```
+To use the whole benchmark you can list the datasets in the collection using the following snippet.
+```python
+  from datasets import load_dataset
+  import huggingface_hub
+  #import collection using its hugging face URL
+  collection = huggingface_hub.get_collection(vidore/vidore-benchmark-667173f98e70a1c0fa4db00d)
+  datasets = []
+  #list datasets and load them
+  datasets_items = collection.items
+  for dataset_item in datasets_items:
+      print(f"Loading {dataset_item.item_id}")
+      dataset = load_dataset(dataset_item.item_id)
+      datasets.append(dataset)
+```
+### Models [add description of released model]
   - [*ColPali*](https://huggingface.co/vidore/colpali): TODO
   - [*BiPali*](https://huggingface.co/vidore/bipali): TODO
   - [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
 ## Autorship + Citation
 TODO : Contact
 **BibTeX Citation**
 ```latex
     [include BibTeX]
+```