Spaces:

vidore
/

README

Running

App Files Files Community

HugSib commited on Jun 25, 2024

Commit

bd656bf

verified ·

1 Parent(s): 0e42978

Update README.md

Browse files

Files changed (1) hide show

README.md +16 -11

README.md CHANGED Viewed

@@ -29,14 +29,27 @@ Combined with a late interaction matching mechanism, *ColPali* largely outperfor
 ## Organisation
 ### Datasets
 We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
 - [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
   datasets ([ArXiVQA](https://huggingface.co/datasets/vidore/arxivqa_test_subsampled), [DocVQA](https://huggingface.co/datasets/vidore/docvqa_test_subsampled),
    [InfoVQA](https://huggingface.co/datasets/vidore/infovqa_test_subsampled), [TATDQA](https://huggingface.co/datasets/vidore/tatdqa_test), [TabFQuAD](https://huggingface.co/datasets/vidore/tabfquad_test_subsampled)) and from datasets synthetically generated spanning various themes and industrial applications:
-  ([Artificial Intelligence](https://huggingface.co/datasets/vidore/syntheticDocQA_artificial_intelligence_test), [Government Reports](https://huggingface.co/datasets/vidore/syntheticDocQA_government_reports_test), [Healthcare Industry](https://huggingface.co/datasets/vidore/syntheticDocQA_healthcare_industry_test), [Energy](https://huggingface.co/datasets/vidore/syntheticDocQA_energy_test) and [Shift Project](https://huggingface.co/datasets/vidore/shiftproject_test)). Further details can be found in the corresponding dataset cards.
-- [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with Tesseract.
-- [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd):  The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
 **Intended use**
@@ -65,14 +78,6 @@ To use the whole benchmark you can list the datasets in the collection using the
       datasets.append(dataset)
 ```
-### Models [add description of released model]
-  - [*ColPali*](https://huggingface.co/vidore/colpali): TODO
-  - [*BiPali*](https://huggingface.co/vidore/bipali): TODO
-  - [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
 ## Autorship + Citation
 TODO : Contact

 ## Organisation
+### Models [add description of released model]
+  - [*ColPali*](https://huggingface.co/vidore/colpali): TODO
+  - [*BiPali*](https://huggingface.co/vidore/bipali): TODO
+  - [*BiSigLip*](https://huggingface.co/vidore/bisiglip): TODO
 ### Datasets
 We organized datasets into collections to constitute our benchmark ViDoRe and its derivates (OCR and Captioning). Below is a brief description of each of them.
 - [*ViDoRe Benchmark*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d): collection regrouping all datasets constituting the ViDoRe benchmark. It includes the test sets from different academic
   datasets ([ArXiVQA](https://huggingface.co/datasets/vidore/arxivqa_test_subsampled), [DocVQA](https://huggingface.co/datasets/vidore/docvqa_test_subsampled),
    [InfoVQA](https://huggingface.co/datasets/vidore/infovqa_test_subsampled), [TATDQA](https://huggingface.co/datasets/vidore/tatdqa_test), [TabFQuAD](https://huggingface.co/datasets/vidore/tabfquad_test_subsampled)) and from datasets synthetically generated spanning various themes and industrial applications:
+  ([Artificial Intelligence](https://huggingface.co/datasets/vidore/syntheticDocQA_artificial_intelligence_test), [Government Reports](https://huggingface.co/datasets/vidore/syntheticDocQA_government_reports_test), [Healthcare Industry](https://huggingface.co/datasets/vidore/syntheticDocQA_healthcare_industry_test), [Energy](https://huggingface.co/datasets/vidore/syntheticDocQA_energy_test) and [Shift Project](https://huggingface.co/datasets/vidore/shiftproject_test)).
+  Further details can be found on the corresponding dataset cards.
+- [*OCR Baseline*](https://huggingface.co/collections/vidore/vidore-chunk-ocr-baseline-666acce88c294ef415548a56): Datasets in this collection are the same as in ViDoRe but preprocessed for textual retrieving. The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are OCRized with Tesseract.
+- [*Captioning Baseline*](https://huggingface.co/collections/vidore/vidore-captioning-baseline-6658a2a62d857c7a345195fd):  Datasets in this collection are the same as in ViDoRe but preprocessed for textual retrieving. The original ViDoRe benchmark was passed to Unstructured to partition each page into chunks. Visual chunks are captioned using Claude Sonnet.
 **Intended use**
       datasets.append(dataset)
 ```
 ## Autorship + Citation
 TODO : Contact