Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,66 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
## Overview
|
| 7 |
+
|
| 8 |
+
[MMDocIR/MMDocIR_Retrievers](https://huggingface.co/MMDocIR/MMDocIR_Retrievers) huggingface repository contains all retriever checkpoints needed for [MMDocIR](https://github.com/MMDocRAG/MMDocIR), specifically to [Download Retriever Checkpoints](https://github.com/MMDocRAG/MMDocIR?tab=readme-ov-file#2-download-retriever-checkpoints).
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
## 🛠️Retriever Checkpoints
|
| 13 |
+
|
| 14 |
+
The list of available retrievers are as follows:
|
| 15 |
+
|
| 16 |
+
- **BGE**: [bge-large-en-v1.5](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/bge-large-en-v1.5) which is cloned from [BAAI](https://huggingface.co/BAAI)/[bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5).
|
| 17 |
+
- **ColBERT**: [colbertv2.0](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colbertv2.0) which is cloned from [colbert-ir](https://huggingface.co/colbert-ir)/[colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0).
|
| 18 |
+
- **E5**: [e5-large-v2](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/e5-large-v2) which is cloned from [ intfloat](https://huggingface.co/intfloat)/[e5-large-v2](https://huggingface.co/intfloat/e5-large-v2).
|
| 19 |
+
- **GTE**: [gte-large](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/gte-large) which is cloned from [thenlper](https://huggingface.co/thenlper)/[gte-large](https://huggingface.co/thenlper/gte-large).
|
| 20 |
+
- **Contriever**: [contriever-msmarco](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/contriever-msmarco) which is cloned from [facebook](https://huggingface.co/facebook)/[contriever-msmarco](https://huggingface.co/facebook/contriever-msmarco).
|
| 21 |
+
- **DPR**:
|
| 22 |
+
- question encoder: [dpr-question_encoder-multiset-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dpr-question_encoder-multiset-base) which is cloned from [facebook](https://huggingface.co/facebook)/[dpr-question_encoder-multiset-base](https://huggingface.co/facebook/dpr-question_encoder-multiset-base).
|
| 23 |
+
- passage encoder: [dpr-ctx_encoder-multiset-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dpr-ctx_encoder-multiset-base) which is cloned from [facebook](https://huggingface.co/facebook)/[dpr-ctx_encoder-multiset-base](https://huggingface.co/facebook/dpr-ctx_encoder-multiset-base).
|
| 24 |
+
- **ColPali**:
|
| 25 |
+
- retriever adapter: [colpali-v1.1](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1) which is cloned from [vidore](https://huggingface.co/vidore)/[colpali-v1.1](https://huggingface.co/vidore/colpali-v1.1).
|
| 26 |
+
- retriever base VLM: [colpaligemma-3b-mix-448-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpaligemma-3b-mix-448-base) which is cloned from [vidore](https://huggingface.co/vidore)/[colpaligemma-3b-mix-448-base](https://huggingface.co/vidore/colpaligemma-3b-mix-448-base).
|
| 27 |
+
- **ColQwen**:
|
| 28 |
+
- retriever adapter: [colqwen2-v1.0](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colqwen2-v1.0) which is cloned from [vidore](https://huggingface.co/vidore)/[colqwen2-v1.0](https://huggingface.co/vidore/colqwen2-v1.0).
|
| 29 |
+
- retriever base VLM: [colqwen2-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colqwen2-base) which is cloned from [vidore](https://huggingface.co/vidore)/[colqwen2-base](https://huggingface.co/vidore/colqwen2-base).
|
| 30 |
+
- **DSE-wikiss**: [dse-phi3-v1](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dse-phi3-v1) which is processed as follows:
|
| 31 |
+
- clone from [Tevatron](https://huggingface.co/Tevatron)/[dse-phi3-v1.0](https://huggingface.co/Tevatron/dse-phi3-v1.0).
|
| 32 |
+
- fix batch processing issue based on: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/32/files
|
| 33 |
+
- change `config.json` and `preprocessor_config.json` to point to .py files in checkpoint.
|
| 34 |
+
|
| 35 |
+
- **DSE-docmatix**: [dse-phi3-docmatix-v2](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dse-phi3-docmatix-v2) which is cloned from [Tevatron](https://huggingface.co/Tevatron)/[dse-phi3-docmatix-v2](https://huggingface.co/Tevatron/dse-phi3-docmatix-v2).
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
## Environment
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
+
python 3.9
|
| 44 |
+
torch2.4.0+cu121
|
| 45 |
+
transformers==4.45.0
|
| 46 |
+
sentence-transformers==2.2.2 # for BGE, GTE, E5 retrievers
|
| 47 |
+
colbert-ai==0.2.21 # for colbert retriever
|
| 48 |
+
flash-attn==2.7.4.post1 # for DSE retrievers to run with flash attention
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
## 💾Citation
|
| 53 |
+
If you use any datasets or models from this organization in your research, please cite our work as follows:
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
@misc{dong2025mmdocirbenchmarkingmultimodalretrieval,
|
| 57 |
+
title={MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents},
|
| 58 |
+
author={Kuicai Dong and Yujing Chang and Xin Deik Goh and Dexun Li and Ruiming Tang and Yong Liu},
|
| 59 |
+
year={2025},
|
| 60 |
+
eprint={2501.08828},
|
| 61 |
+
archivePrefix={arXiv},
|
| 62 |
+
primaryClass={cs.IR},
|
| 63 |
+
url={https://arxiv.org/abs/2501.08828},
|
| 64 |
+
}
|
| 65 |
+
```
|
| 66 |
+
|