Safetensors
English
qwen2_5_vl
hmhm1229 commited on
Commit
d83ff54
Β·
verified Β·
1 Parent(s): 50fe59e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -22
README.md CHANGED
@@ -4,36 +4,179 @@ language:
4
  - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
 
 
7
  ---
8
 
9
- <div align="center">
 
 
 
10
 
11
- <h1> VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation </h1>
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
 
 
13
 
14
- <h5 align="center">
15
 
16
- <a href='xxx'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
17
- <a href='https://huggingface.co/openbmb/EVisRAG-7B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'>
18
 
19
- Yubo Sun<sup>1</sup>,
20
- Chunyi Peng<sup>2</sup>,
21
- Yukun Yan<sup>3</sup>,
22
- Shi Yu<sup>3</sup>,
23
- Zhenghao Liu<sup>2</sup>,
24
- Chi Chen<sup>3</sup>
25
- Zhiyuan Liu<sup>3</sup>,
26
- Maosong Sun<sup>3</sup>
27
 
28
- <sup>1</sup>Peking University, <sup>2</sup>Northeastern University, <sup>3</sup>Tsinghua University.
29
 
30
- <h5 align="center"> If you find this project useful, please give us a star🌟.
31
- </h5>
32
- </div>
33
 
34
- ## Contact Us
35
- If you have questions, suggestions, and bug reports, please email us, we will try our best to help you.
 
 
 
 
 
36
  ```
37
- syb2000417@stu.pku.edu.cn
38
- hm.cypeng@gmail.com
39
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
+ datasets:
8
+ - openbmb/EVisRAG-Train
9
  ---
10
 
11
+ # VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
12
+ [![Github](https://img.shields.io/badge/VisRAG-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/OpenBMB/VisRAG)
13
+ [![arXiv](https://img.shields.io/badge/arXiv-2410.10594-ff0000.svg?style=for-the-badge)](https://arxiv.org/abs/2410.10594)
14
+ [![Hugging Face](https://img.shields.io/badge/EvisRAG-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/openbmb/EVisRAG-7B)
15
 
16
+ <p align="center">β€’
17
+ <a href="#-introduction"> πŸ“– Introduction </a> β€’
18
+ <a href="#-news">πŸŽ‰ News</a> β€’
19
+ <a href="#-visrag-pipeline">✨ VisRAG Pipeline</a> β€’
20
+ <a href="#%EF%B8%8F-setup">βš™οΈ Setup</a> β€’
21
+ <a href="#%EF%B8%8F-training">⚑️ Training</a>
22
+ </p>
23
+ <p align="center">β€’
24
+ <a href="#-evaluation">πŸ“ƒ Evaluation</a> β€’
25
+ <a href="#-usage">πŸ”§ Usage</a> β€’
26
+ <a href="#-license">πŸ“„ Lisense</a> β€’
27
+ <a href="#-contact">πŸ“§ Contact</a> β€’
28
+ <a href="#-star-history">πŸ“ˆ Star History</a>
29
+ </p>
30
 
31
+ # πŸ“– Introduction
32
+ **EVisRAG (VisRAG 2.0)** is an evidence-guided Vision Retrieval-augmented Generation framework that equips VLMs for multi-image questions by first linguistically observing retrieved images to collect per-image evidence, then reasoning over those cues to answer. **EVisRAG** trains with Reward-Scoped GRPO, applying fine-grained token-level rewards to jointly optimize visual perception and reasoning.
33
 
34
+ <p align="center"><img width=800 src="assets/evisrag.png"/></p>
35
 
36
+ # πŸŽ‰ News
 
37
 
38
+ * 20251001: Released **EVisRAG (VisRAG 2.0)**, an end-to-end Vision-Language Model. Released our [Paper]() on arXiv. Released our [Model](https://huggingface.co/openbmb/EVisRAG-7B) on Hugging Face. Released our [Code](https://github.com/OpenBMB/VisRAG) on GitHub
 
 
 
 
 
 
 
39
 
40
+ # ✨ EVisRAG Pipeline
41
 
42
+ **EVisRAG** is an end-to-end framework which equips VLMs with precise visual perception during reasoning in multi-image scenarios. We trained and realeased VLRMs with EVisRAG built on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), and [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct).
 
 
43
 
44
+ # βš™οΈ Setup
45
+ ```bash
46
+ git clone https://github.com/OpenBMB/VisRAG.git
47
+ conda create --name EVisRAG python==3.10
48
+ conda activate EVisRAG
49
+ cd EVisRAG
50
+ pip install -r EVisRAG_requirements.txt
51
  ```
52
+ # ⚑️ Training
53
+
54
+ ***Stage1: SFT*** (based on [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory))
55
+
56
+ ```bash
57
+ git clone https://github.com/hiyouga/LLaMA-Factory.git
58
+ bash evisrag_scripts/full_sft.sh
59
+ ```
60
+
61
+ ***Stage2: RS-GRPO*** (based on [Easy-R1](https://github.com/hiyouga/EasyR1))
62
+
63
+ ```bash
64
+ bash evisrag_scripts/run_rsgrpo.sh
65
+ ```
66
+
67
+ Notes:
68
+
69
+ 1. The training data is available on Hugging Face under `EVisRAG-Train`, which is referenced at the beginning of this page.
70
+ 2. We adopt a two-stage training strategy. In the first stage, please clone `LLaMA-Factory` and update the model path in the full_sft.sh script. In the second stage, we built our customized algorithm `RS-GRPO` based on `Easy-R1`, specifically designed for EVisRAG, whose implementation can be found in `src/RS-GRPO`.
71
+
72
+ # πŸ“ƒ Evaluation
73
+ ```bash
74
+ bash evisrag_scripts/predict.sh
75
+ bash evisrag_scripts/eval.sh
76
+ ```
77
+
78
+ Notes:
79
+
80
+ 1. The test data is available on Hugging Face under `EVisRAG-Test-xxx`, as referenced at the beginning of this page.
81
+ 2. To run evaluation, first execute the `predict.sh` script. The model outputs will be saved in the preds directory. Then, use the `eval.sh` script to evaluate the predictions. The metrics `EM`, `Accuracy`, and `F1` will be reported directly.
82
+
83
+ # πŸ”§ Usage
84
+
85
+ Model on Hugging Face: https://huggingface.co/openbmb/EVisRAG-7B
86
+
87
+ ```python
88
+ from transformers import AutoProcessor
89
+ from vllm import LLM, SamplingParams
90
+ from qwen_vl_utils import process_vision_info
91
+
92
+ def evidence_promot_grpo(query):
93
+ return f"""You are an AI Visual QA assistant. I will provide you with a question and several images. Please follow the four steps below:
94
+
95
+ Step 1: Observe the Images
96
+ First, analyze the question and consider what types of images may contain relevant information. Then, examine each image one by one, paying special attention to aspects related to the question. Identify whether each image contains any potentially relevant information.
97
+ Wrap your observations within <observe></observe> tags.
98
+
99
+ Step 2: Record Evidences from Images
100
+ After reviewing all images, record the evidence you find for each image within <evidence></evidence> tags.
101
+ If you are certain that an image contains no relevant information, record it as: [i]: no relevant information(where i denotes the index of the image).
102
+ If an image contains relevant evidence, record it as: [j]: [the evidence you find for the question](where j is the index of the image).
103
+
104
+ Step 3: Reason Based on the Question and Evidences
105
+ Based on the recorded evidences, reason about the answer to the question.
106
+ Include your step-by-step reasoning within <think></think> tags.
107
+
108
+ Step 4: Answer the Question
109
+ Provide your final answer based only on the evidences you found in the images.
110
+ Wrap your answer within <answer></answer> tags.
111
+ Avoid adding unnecessary contents in your final answer, like if the question is a yes/no question, simply answer "yes" or "no".
112
+ If none of the images contain sufficient information to answer the question, respond with <answer>insufficient to answer</answer>.
113
+
114
+ Formatting Requirements:
115
+ Use the exact tags <observe>, <evidence>, <think>, and <answer> for structured output.
116
+ It is possible that none, one, or several images contain relevant evidence.
117
+ If you find no evidence or few evidences, and insufficient to help you answer the question, follow the instruction above for insufficient information.
118
+
119
+ Question and images are provided below. Please follow the steps as instructed.
120
+ Question: {query}
121
+ """
122
+
123
+ model_path = "xxx"
124
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, padding_side='left')
125
+
126
+ imgs, query = ["imgpath1", "imgpath2", ..., "imgpathX"], "What xxx?"
127
+ input_prompt = evidence_promot_grpo(query)
128
+
129
+ content = [{"type": "text", "text": input_prompt}]
130
+ for imgP in imgs:
131
+ content.append({
132
+ "type": "image",
133
+ "image": imgP
134
+ })
135
+ msg = [{
136
+ "role": "user",
137
+ "content": content,
138
+ }]
139
+
140
+ llm = LLM(
141
+ model=model_path,
142
+ tensor_parallel_size=1,
143
+ dtype="bfloat16",
144
+ limit_mm_per_prompt={"image":5, "video":0},
145
+ )
146
+
147
+ sampling_params = SamplingParams(
148
+ temperature=0.1,
149
+ repetition_penalty=1.05,
150
+ max_tokens=2048,
151
+ )
152
+
153
+ prompt = processor.apply_chat_template(
154
+ msg,
155
+ tokenize=False,
156
+ add_generation_prompt=True,
157
+ )
158
+
159
+ image_inputs, _ = process_vision_info(msg)
160
+
161
+ msg_input = [{
162
+ "prompt": prompt,
163
+ "multi_modal_data": {"image": image_inputs},
164
+ }]
165
+
166
+ output_texts = llm.generate(msg_input,
167
+ sampling_params=sampling_params,
168
+ )
169
+
170
+ print(output_texts[0].outputs[0].text)
171
+ ```
172
+
173
+
174
+ # πŸ“„ License
175
+
176
+ * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
177
+ * The usage of **EVisRAG** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
178
+
179
+ # πŸ“§ Contact
180
+ ## EVisRAG
181
+ - Yubo Sun: syb2000417@stu.pku.edu.cn
182
+ - Chunyi Peng: hm.cypeng@gmail.com