Safetensors
daviddongdong commited on
Commit
0f0a5a9
·
verified ·
1 Parent(s): 4ab5e0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ ## Overview
7
+
8
+ [MMDocIR/MMDocIR_Retrievers](https://huggingface.co/MMDocIR/MMDocIR_Retrievers) huggingface repository contains all retriever checkpoints needed for [MMDocIR](https://github.com/MMDocRAG/MMDocIR), specifically to [Download Retriever Checkpoints](https://github.com/MMDocRAG/MMDocIR?tab=readme-ov-file#2-download-retriever-checkpoints).
9
+
10
+
11
+
12
+ ## 🛠️Retriever Checkpoints
13
+
14
+ The list of available retrievers are as follows:
15
+
16
+ - **BGE**: [bge-large-en-v1.5](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/bge-large-en-v1.5) which is cloned from [BAAI](https://huggingface.co/BAAI)/[bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5).
17
+ - **ColBERT**: [colbertv2.0](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colbertv2.0) which is cloned from [colbert-ir](https://huggingface.co/colbert-ir)/[colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0).
18
+ - **E5**: [e5-large-v2](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/e5-large-v2) which is cloned from [ intfloat](https://huggingface.co/intfloat)/[e5-large-v2](https://huggingface.co/intfloat/e5-large-v2).
19
+ - **GTE**: [gte-large](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/gte-large) which is cloned from [thenlper](https://huggingface.co/thenlper)/[gte-large](https://huggingface.co/thenlper/gte-large).
20
+ - **Contriever**: [contriever-msmarco](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/contriever-msmarco) which is cloned from [facebook](https://huggingface.co/facebook)/[contriever-msmarco](https://huggingface.co/facebook/contriever-msmarco).
21
+ - **DPR**:
22
+ - question encoder: [dpr-question_encoder-multiset-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dpr-question_encoder-multiset-base) which is cloned from [facebook](https://huggingface.co/facebook)/[dpr-question_encoder-multiset-base](https://huggingface.co/facebook/dpr-question_encoder-multiset-base).
23
+ - passage encoder: [dpr-ctx_encoder-multiset-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dpr-ctx_encoder-multiset-base) which is cloned from [facebook](https://huggingface.co/facebook)/[dpr-ctx_encoder-multiset-base](https://huggingface.co/facebook/dpr-ctx_encoder-multiset-base).
24
+ - **ColPali**:
25
+ - retriever adapter: [colpali-v1.1](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpali-v1.1) which is cloned from [vidore](https://huggingface.co/vidore)/[colpali-v1.1](https://huggingface.co/vidore/colpali-v1.1).
26
+ - retriever base VLM: [colpaligemma-3b-mix-448-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colpaligemma-3b-mix-448-base) which is cloned from [vidore](https://huggingface.co/vidore)/[colpaligemma-3b-mix-448-base](https://huggingface.co/vidore/colpaligemma-3b-mix-448-base).
27
+ - **ColQwen**:
28
+ - retriever adapter: [colqwen2-v1.0](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colqwen2-v1.0) which is cloned from [vidore](https://huggingface.co/vidore)/[colqwen2-v1.0](https://huggingface.co/vidore/colqwen2-v1.0).
29
+ - retriever base VLM: [colqwen2-base](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/colqwen2-base) which is cloned from [vidore](https://huggingface.co/vidore)/[colqwen2-base](https://huggingface.co/vidore/colqwen2-base).
30
+ - **DSE-wikiss**: [dse-phi3-v1](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dse-phi3-v1) which is processed as follows:
31
+ - clone from [Tevatron](https://huggingface.co/Tevatron)/[dse-phi3-v1.0](https://huggingface.co/Tevatron/dse-phi3-v1.0).
32
+ - fix batch processing issue based on: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/32/files
33
+ - change `config.json` and `preprocessor_config.json` to point to .py files in checkpoint.
34
+
35
+ - **DSE-docmatix**: [dse-phi3-docmatix-v2](https://huggingface.co/MMDocIR/MMDocIR_Retrievers/tree/main/dse-phi3-docmatix-v2) which is cloned from [Tevatron](https://huggingface.co/Tevatron)/[dse-phi3-docmatix-v2](https://huggingface.co/Tevatron/dse-phi3-docmatix-v2).
36
+
37
+
38
+
39
+
40
+ ## Environment
41
+
42
+ ```bash
43
+ python 3.9
44
+ torch2.4.0+cu121
45
+ transformers==4.45.0
46
+ sentence-transformers==2.2.2 # for BGE, GTE, E5 retrievers
47
+ colbert-ai==0.2.21 # for colbert retriever
48
+ flash-attn==2.7.4.post1 # for DSE retrievers to run with flash attention
49
+ ```
50
+
51
+
52
+ ## 💾Citation
53
+ If you use any datasets or models from this organization in your research, please cite our work as follows:
54
+
55
+ ```
56
+ @misc{dong2025mmdocirbenchmarkingmultimodalretrieval,
57
+ title={MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents},
58
+ author={Kuicai Dong and Yujing Chang and Xin Deik Goh and Dexun Li and Ruiming Tang and Yong Liu},
59
+ year={2025},
60
+ eprint={2501.08828},
61
+ archivePrefix={arXiv},
62
+ primaryClass={cs.IR},
63
+ url={https://arxiv.org/abs/2501.08828},
64
+ }
65
+ ```
66
+