Update README.md
Browse files
README.md
CHANGED
|
@@ -27,27 +27,47 @@ pipeline_tag: feature-extraction
|
|
| 27 |
</a>
|
| 28 |
</div>
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
|
| 31 |
<p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
|
| 37 |
|
| 38 |
-
|
| 39 |
In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
|
| 45 |
|
| 46 |
|
| 47 |
-
|
| 48 |
The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
|
| 49 |
|
| 50 |
-
|
| 51 |
```
|
| 52 |
torch==2.1.2
|
| 53 |
torchvision==0.16.2
|
|
@@ -57,9 +77,9 @@ decord==0.6.0
|
|
| 57 |
Pillow==10.1.0
|
| 58 |
```
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
```python
|
| 64 |
from transformers import AutoModel, AutoTokenizer
|
| 65 |
import torch
|
|
@@ -125,13 +145,13 @@ scores = (embeddings_query @ embeddings_doc.T)
|
|
| 125 |
print(scores.tolist())
|
| 126 |
```
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
| 131 |
* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
|
| 132 |
* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
```
|
| 137 |
@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
|
|
@@ -145,7 +165,7 @@ print(scores.tolist())
|
|
| 145 |
}
|
| 146 |
```
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
- Shi Yu: yus21@mails.tsinghua.edu.cn
|
| 151 |
- Chaoyue Tang: tcy006@gmail.com
|
|
|
|
| 27 |
</a>
|
| 28 |
</div>
|
| 29 |
|
| 30 |
+
<p align="center">β’
|
| 31 |
+
<a href="#-introduction"> π Introduction </a> β’
|
| 32 |
+
<a href="#-news">π News</a> β’
|
| 33 |
+
<a href="#-visrag-pipeline">β¨ VisRAG Pipeline</a> β’
|
| 34 |
+
<a href="-training">β‘οΈ Training</a>
|
| 35 |
+
</p>
|
| 36 |
+
<p align="center">β’
|
| 37 |
+
<a href="#-requirements">π¦ Requirements</a> β’
|
| 38 |
+
<a href="#-usage">π§ Usage</a> β’
|
| 39 |
+
<a href="#-license">π Lisense</a>β’
|
| 40 |
+
<a href="-citation">π Citation</a>
|
| 41 |
+
<a href="-contact">π§ Contact</a>
|
| 42 |
+
</p>
|
| 43 |
+
|
| 44 |
+
# π Introduction
|
| 45 |
**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
|
| 46 |
<p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
|
| 47 |
|
| 48 |
+
# π News
|
| 49 |
|
| 50 |
+
* 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.
|
| 51 |
+
* 20241014: Released our [Paper](https://arxiv.org/abs/2410.10594) on arXiv. Released our [Model](https://huggingface.co/openbmb/VisRAG-Ret) on Hugging Face.
|
| 52 |
+
|
| 53 |
+
# β¨ VisRAG Pipeline
|
| 54 |
+
|
| 55 |
+
## VisRAG-Ret
|
| 56 |
**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
|
| 57 |
|
| 58 |
+
## VisRAG-Gen
|
| 59 |
In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
|
| 60 |
|
| 61 |
+
# β‘οΈ Training
|
| 62 |
|
| 63 |
+
## VisRAG-Ret
|
| 64 |
Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
|
| 65 |
|
| 66 |
|
| 67 |
+
## VisRAG-Gen
|
| 68 |
The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
|
| 69 |
|
| 70 |
+
# π¦ Requirements
|
| 71 |
```
|
| 72 |
torch==2.1.2
|
| 73 |
torchvision==0.16.2
|
|
|
|
| 77 |
Pillow==10.1.0
|
| 78 |
```
|
| 79 |
|
| 80 |
+
# π§ Usage
|
| 81 |
|
| 82 |
+
## VisRAG-Ret
|
| 83 |
```python
|
| 84 |
from transformers import AutoModel, AutoTokenizer
|
| 85 |
import torch
|
|
|
|
| 145 |
print(scores.tolist())
|
| 146 |
```
|
| 147 |
|
| 148 |
+
# π License
|
| 149 |
|
| 150 |
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
| 151 |
* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
|
| 152 |
* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
|
| 153 |
|
| 154 |
+
# π Citation
|
| 155 |
|
| 156 |
```
|
| 157 |
@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
|
|
|
|
| 165 |
}
|
| 166 |
```
|
| 167 |
|
| 168 |
+
# π§ Contact
|
| 169 |
|
| 170 |
- Shi Yu: yus21@mails.tsinghua.edu.cn
|
| 171 |
- Chaoyue Tang: tcy006@gmail.com
|