robustvisrag commited on
Commit
3041568
·
verified ·
1 Parent(s): 5d0ef35

Remove README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -179
README.md DELETED
@@ -1,179 +0,0 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - openbmb/VisRAG-Ret-Train-In-domain-data
5
- - openbmb/VisRAG-Ret-Train-Synthetic-data
6
- language:
7
- - en
8
- base_model:
9
- - openbmb/MiniCPM-V-2
10
- tags:
11
- - VisRAG
12
- pipeline_tag: feature-extraction
13
- ---
14
- # VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
15
- <div style="display: flex; align-items: center;">
16
- <a href="https://huggingface.co/openbmb/VisRAG-Ret" style="margin-right: 10px;">
17
- <img src="https://img.shields.io/badge/VisRAG_Ret-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Ret">
18
- </a>
19
- <a href="https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69" style="margin-right: 10px;">
20
- <img src="https://img.shields.io/badge/VisRAG_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Collection">
21
- </a>
22
- <a href="https://huggingface.co/spaces/tcy6/VisRAG_Pipeline" style="margin-right: 10px;">
23
- <img src="https://img.shields.io/badge/VisRAG_Pipeline-fcd022?style=for-the-badge&logo=huggingface&logoColor=000" alt="VisRAG Pipeline">
24
- </a>
25
- <a href="https://arxiv.org/abs/2410.10594" style="margin-right: 10px;">
26
- <img src="https://img.shields.io/badge/arXiv-2410.10594-ff0000.svg?style=for-the-badge" alt="arXiv">
27
- </a>
28
- <a href="https://colab.research.google.com/drive/11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing" style="margin-right: 10px;">
29
- <img src="https://img.shields.io/badge/VisRAG_Pipeline-ffffff?style=for-the-badge&logo=googlecolab&logoColor=f9ab00" alt="Google Colab">
30
- </a>
31
- <a href="https://github.com/openbmb/VisRAG" style="margin-right: 10px;">
32
- <img src="https://img.shields.io/badge/VisRAG-000000?style=for-the-badge&logo=github&logoColor=white" alt="GitHub">
33
- </a>
34
- </div>
35
-
36
- <p align="center">•
37
- <a href="#📖-introduction"> 📖 Introduction </a> •
38
- <a href="#🎉-news">🎉 News</a> •
39
- <a href="#✨-visrag-pipeline">✨ VisRAG Pipeline</a> •
40
- <a href="#⚡️-training">⚡️ Training</a>
41
- </p>
42
- <p align="center">•
43
- <a href="#📦-requirements">📦 Requirements</a> •
44
- <a href="#🔧-usage">🔧 Usage</a> •
45
- <a href="#📄-license">📄 Lisense</a> •
46
- <a href="#📑-citation">📑 Citation</a> •
47
- <a href="#📧-contact">📧 Contact</a>
48
- </p>
49
-
50
- # 📖 Introduction
51
- **VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
52
- <p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
53
-
54
- # 🎉 News
55
-
56
- * 20241104: Released our [VisRAG Pipeline](https://huggingface.co/spaces/tcy6/VisRAG_Pipeline) on Hugging Face Space.
57
- * 20241031: Released our [VisRAG Pipeline](https://colab.research.google.com/drive/11KV9adDNXPfHiuFAfXNOvtYJKcyR8JZH?usp=sharing) on Colab.
58
- * 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.
59
- * 20241014: Released our [Paper](https://arxiv.org/abs/2410.10594) on arXiv. Released our [Model](https://huggingface.co/openbmb/VisRAG-Ret) on Hugging Face. Released our [Code](https://github.com/OpenBMB/VisRAG) on GitHub.
60
-
61
- # ✨ VisRAG Pipeline
62
-
63
- ## VisRAG-Ret
64
- **VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
65
-
66
- ## VisRAG-Gen
67
- In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
68
-
69
- # ⚡️ Training
70
-
71
- ## VisRAG-Ret
72
- Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
73
-
74
-
75
- ## VisRAG-Gen
76
- The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
77
-
78
- # 📦 Requirements
79
- ```
80
- torch==2.1.2
81
- torchvision==0.16.2
82
- transformers==4.40.2
83
- sentencepiece==0.1.99
84
- decord==0.6.0
85
- Pillow==10.1.0
86
- ```
87
-
88
- # 🔧 Usage
89
-
90
- ## VisRAG-Ret
91
- ```python
92
- from transformers import AutoModel, AutoTokenizer
93
- import torch
94
- import torch.nn.functional as F
95
- from PIL import Image
96
- import requests
97
- from io import BytesIO
98
-
99
- def weighted_mean_pooling(hidden, attention_mask):
100
- attention_mask_ = attention_mask * attention_mask.cumsum(dim=1)
101
- s = torch.sum(hidden * attention_mask_.unsqueeze(-1).float(), dim=1)
102
- d = attention_mask_.sum(dim=1, keepdim=True).float()
103
- reps = s / d
104
- return reps
105
-
106
- @torch.no_grad()
107
- def encode(text_or_image_list):
108
-
109
- if (isinstance(text_or_image_list[0], str)):
110
- inputs = {
111
- "text": text_or_image_list,
112
- 'image': [None] * len(text_or_image_list),
113
- 'tokenizer': tokenizer
114
- }
115
- else:
116
- inputs = {
117
- "text": [''] * len(text_or_image_list),
118
- 'image': text_or_image_list,
119
- 'tokenizer': tokenizer
120
- }
121
- outputs = model(**inputs)
122
- attention_mask = outputs.attention_mask
123
- hidden = outputs.last_hidden_state
124
-
125
- reps = weighted_mean_pooling(hidden, attention_mask)
126
- embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
127
- return embeddings
128
-
129
- model_name_or_path = "openbmb/VisRAG-Ret"
130
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
131
- model = AutoModel.from_pretrained(model_name_or_path, torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
132
- model.eval()
133
-
134
- queries = ["What does a dog look like?"]
135
- INSTRUCTION = "Represent this query for retrieving relevant documents: "
136
- queries = [INSTRUCTION + query for query in queries]
137
-
138
- print("Downloading images...")
139
- passages = [
140
- Image.open(BytesIO(requests.get(
141
- 'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/cat.jpeg'
142
- ).content)).convert('RGB'),
143
- Image.open(BytesIO(requests.get(
144
- 'https://github.com/OpenBMB/VisRAG/raw/refs/heads/master/scripts/demo/retriever/test_image/dog.jpg'
145
- ).content)).convert('RGB')
146
- ]
147
- print("Images downloaded.")
148
-
149
- embeddings_query = encode(queries)
150
- embeddings_doc = encode(passages)
151
-
152
- scores = (embeddings_query @ embeddings_doc.T)
153
- print(scores.tolist())
154
- ```
155
-
156
- # 📄 License
157
-
158
- * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
159
- * The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
160
- * The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
161
-
162
- # 📑 Citation
163
-
164
- ```
165
- @misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
166
- title={VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents},
167
- author={Shi Yu and Chaoyue Tang and Bokai Xu and Junbo Cui and Junhao Ran and Yukun Yan and Zhenghao Liu and Shuo Wang and Xu Han and Zhiyuan Liu and Maosong Sun},
168
- year={2024},
169
- eprint={2410.10594},
170
- archivePrefix={arXiv},
171
- primaryClass={cs.IR},
172
- url={https://arxiv.org/abs/2410.10594},
173
- }
174
- ```
175
-
176
- # 📧 Contact
177
-
178
- - Shi Yu: yus21@mails.tsinghua.edu.cn
179
- - Chaoyue Tang: tcy006@gmail.com