AaronCIH commited on
Commit
a8c50ce
·
verified ·
1 Parent(s): 60bd5cb

Upload folder using huggingface_hub

Browse files
scripts/VisRAG_Dataset.md ADDED
@@ -0,0 +1,360 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # datasets
2
+
3
+ ## load datasets(train)的方法:
4
+ ```
5
+ from datasets import load_dataset
6
+ db = load_dataset(...)["train"]
7
+ for x in db:
8
+ # x 是一個 set{}, , e.g.
9
+ # {"corpus-id": "6519.png", "image": <PIL.PngImagePlugin.PngImageFile\
10
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
11
+ ...
12
+ ```
13
+ ## load datasets(test)的方法:
14
+ ```
15
+ from datasets import load_dataset
16
+ dbcorpus = load_dataset(..., "corpus")["train"]
17
+ dbqrels = load_dataset(..., "qrels")["train"]
18
+ dbqueries = load_dataset(..., "queries")["train"]
19
+ ```
20
+ ## 如果是圖片集合
21
+ ```
22
+ for x in dbcorpus:
23
+ # x 是一個 set{}, , e.g.
24
+ # {"corpus-id": "圖片的id", "image": <PIL.PngImagePlugin.PngImageFile\
25
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
26
+ ...
27
+ for x in dbqrels:
28
+ # x 是一個 set{}, , e.g.
29
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
30
+ ...
31
+ for x in dbqueries:
32
+ # x 是一個 set{}, , e.g.
33
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
34
+ ...
35
+ ```
36
+
37
+ ## 如果是OCR資料集
38
+ ```
39
+ for x in dbcorpus:
40
+ # x 是一個 set{}, , e.g.
41
+ # {"corpus-id": "6519.png", "text": "string to describe a photo"}
42
+ ...
43
+ for x in dbqrels:
44
+ # x 是一個 set{}, , e.g.
45
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
46
+ ...
47
+ for x in dbqueries:
48
+ # x 是一個 set{}, , e.g.
49
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
50
+ ...
51
+ ```
52
+
53
+
54
+
55
+ # Train datasets:
56
+ ## arxiv, plotqa, ... 的122k的indomain資料集
57
+ ```
58
+ load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
59
+ ```
60
+ ## 合成的239k的資料集
61
+ ```
62
+ load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
63
+ ```
64
+ # Test datasets: (每個test datasets分3個split)(有圖片版本 跟 OCR版本)
65
+ # 圖片版本
66
+ ## 乾淨的PlotQA
67
+ ```
68
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
69
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
70
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
71
+ ```
72
+ ## 乾淨的SlideVQA
73
+ ```
74
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
75
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
76
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
77
+ ```
78
+ ## 乾淨的InfoVQA
79
+ ```
80
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
81
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
82
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
83
+ ```
84
+ ## 乾淨的ArxivQA
85
+ ```
86
+ oad_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
87
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
88
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
89
+ ```
90
+ ## 乾淨的ChartQA
91
+ ```
92
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
93
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
94
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
95
+ ```
96
+ ## 乾淨的MP-DocVQA
97
+ ```
98
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
99
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
100
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
101
+ ```
102
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
103
+ ```
104
+ load_dataset("rweics5cs7/exo3-original-PlotQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
105
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
106
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
107
+ ```
108
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
109
+ ```
110
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
111
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
112
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
113
+ ```
114
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
115
+ ```
116
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
117
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
118
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
119
+ ```
120
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
121
+ ```
122
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
123
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
124
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
125
+ ```
126
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
127
+ ```
128
+ load_dataset("rweics5cs7/exo3-original-ChartQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
129
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
130
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
131
+ ```
132
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
133
+ ```
134
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
135
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
136
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
137
+
138
+ ```
139
+ ## rvl cdip (3k) 乾淨的
140
+ ```
141
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
142
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
143
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
144
+ ```
145
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
146
+ ```
147
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
148
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
149
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
150
+ ```
151
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
152
+ ```
153
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
154
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
155
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
156
+ ```
157
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
158
+ ```
159
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
160
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
161
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
162
+ ```
163
+
164
+ # OCR版本 (PPOCR-v5)
165
+ ## 乾淨的PlotQA
166
+ ```
167
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
168
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
169
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
170
+ ```
171
+ ## 乾淨的SlideVQA
172
+ ```
173
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
174
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
175
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
176
+ ```
177
+ ## 乾淨的InfoVQA
178
+ ```
179
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
180
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
181
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
182
+ ```
183
+ ## 乾淨的ArxivQA
184
+ ```
185
+ oad_dataset("rweics5cs7/exo3-original-ArxivQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
186
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
187
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
188
+ ```
189
+ ## 乾淨的ChartQA
190
+ ```
191
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
192
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
193
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
194
+ ```
195
+ ## 乾淨的MP-DocVQA
196
+ ```
197
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
198
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
199
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
200
+ ```
201
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
202
+ ```
203
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
204
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
205
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
206
+ ```
207
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
208
+ ```
209
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
210
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
211
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
212
+ ```
213
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
214
+ ```
215
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
216
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
217
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
218
+ ```
219
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
220
+ ```
221
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
222
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
223
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
224
+ ```
225
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
226
+ ```
227
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
228
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
229
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
230
+ ```
231
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
232
+ ```
233
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
234
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
235
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
236
+
237
+ ```
238
+ ## rvl cdip (3k) 乾淨的
239
+ ```
240
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
241
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
242
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
243
+ ```
244
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
245
+ ```
246
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
247
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
248
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
249
+ ```
250
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
251
+ ```
252
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
253
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
254
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
255
+ ```
256
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
257
+ ```
258
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
259
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
260
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
261
+ ```
262
+
263
+ # OCR版本 (PPOCR-v3)
264
+ ## 乾淨的PlotQA
265
+ ```
266
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
267
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
268
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
269
+ ```
270
+ ## 乾淨的SlideVQA
271
+ ```
272
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
273
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
274
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
275
+ ```
276
+ ## 乾淨的InfoVQA
277
+ ```
278
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
279
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
280
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
281
+ ```
282
+ ## 乾淨的ArxivQA
283
+ ```
284
+ oad_dataset("rweics5cs7/exo3-original-ArxivQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
285
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
286
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
287
+ ```
288
+ ## 乾淨的ChartQA
289
+ ```
290
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
291
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
292
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
293
+ ```
294
+ ## 乾淨的MP-DocVQA
295
+ ```
296
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
297
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
298
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
299
+ ```
300
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
301
+ ```
302
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
303
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
304
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
305
+ ```
306
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
307
+ ```
308
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
309
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
310
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
311
+ ```
312
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
313
+ ```
314
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
315
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
316
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
317
+ ```
318
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
319
+ ```
320
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
321
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
322
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
323
+ ```
324
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
325
+ ```
326
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
327
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
328
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
329
+ ```
330
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
331
+ ```
332
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
333
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
334
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
335
+
336
+ ```
337
+ ## rvl cdip (3k) 乾淨的
338
+ ```
339
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
340
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
341
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
342
+ ```
343
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
344
+ ```
345
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
346
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
347
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
348
+ ```
349
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
350
+ ```
351
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
352
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
353
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
354
+ ```
355
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
356
+ ```
357
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
358
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
359
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
360
+ ```
scripts/VisRAG_code.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # 跑通
3
+
4
+ ## 安裝VisRAG環境
5
+ 安裝VisRAG.
6
+ ```
7
+ git clone https://github.com/OpenBMB/VisRAG.git
8
+ conda create --name VisRAG python==3.10.8
9
+ conda activate VisRAG
10
+ conda install nvidia/label/cuda-11.8.0::cuda-toolkit
11
+ cd VisRAG
12
+ pip install -r requirements.txt
13
+ pip install -e .
14
+ cd timm_modified
15
+ pip install -e .
16
+ cd ..
17
+ ```
18
+ 0. 把原先的 "/VisRAG/src", "/VisRAG/script" 更名或刪除, 換成我寫的 "/mnt/191/a/lyw/visrag2/VisRAG/src" , "/mnt/191/a/lyw/visrag2/VisRAG/scripts"
19
+ 1. pip install peft==0.10.0 # lora (印象太高版本無法跑, 0.10.0 還行)
20
+ 2. pip install scikit-image==0.25.2 # degradation 會用到
21
+ 3. pip install opencv-python==4.11.0.86 # degradation 會用到
22
+
23
+ 4. 改掉 training script 跟 eval script 的路徑
24
+ ```
25
+ a. 改掉 "/VisRAG/scripts/train_retriever/train.sh" 裡面的 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/lib" 跟 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/bin/python" 換成你conda的路徑.(共兩處)
26
+ b. 改掉 "/VisRAG/scripts/eval_retriever"的: eval.sh, eval2.sh, evalreal.sh, evalreal2.sh, evalreal3.sh 裡面的 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/bin/python"換成你的 python 路徑.(共兩處)
27
+ ```
28
+ 5. 去 "/VisRAG/src/openmatch/arguments.py" 改 "*cache_dir" 成自己的路徑 (可用ctrl+f 尋找 /mnt/191/ 有兩個)
29
+
30
+
31
+ 6. 下載 我們的model(VisIR)(LLM+vpm1+vpm2+zero_linear)
32
+ ```
33
+ a. VisIR(乾淨未訓練, zero_linear全0): "/mnt/191/a/lyw/VisRAG/openbmb/VisIR" (整個資料夾都要)
34
+ b. VisIR(stage1): "/mnt/191/a/lyw/VisRAG/checkpoints/train-2025-06-19-065530-model-data-lr-1e-5-softm_temp-0.02-bsz16-ngpus4-nnodes1-inbatch--nepoch-1-pooling-wmean-attention-causal-qinstruct-true-cinstruct-false-gradcache-true-passage-stopgrad-false-npassage-1" (整個資料夾都要)
35
+ c. VisIR(stage1+2): "/mnt/191/a/lyw/VisRAG/checkpoints/train-2025-06-19-194437-model-data-lr-1e-5-softm_temp-0.02-bsz16-ngpus4-nnodes1-inbatch--nepoch-1-pooling-wmean-attention-causal-qinstruct-true-cinstruct-false-gradcache-true-passage-stopgrad-false-npassage-1" (整個資料夾都要)
36
+ ```
37
+
38
+ ## eval指令
39
+ eval腳本的差異如下
40
+ ```
41
+ eval script 有五個, 分別對應5個測試資料集
42
+ 1. eval.sh: 可測乾淨的 ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA
43
+ 2. eval2.sh: 可測degraded的 ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA
44
+ 3. evalreal.sh: 可測乾淨的 rvl-cdip
45
+ 4. evalreal2.sh: 可測realworld的 rvl-cdip
46
+ 5. evalreal3.sh: 可測realworld的 MP-DocVQA
47
+ 但其實他們的差別只有在CHECKPOINT_DIR(儲存測試結果位置), CORPUS_PATH, QUERY_PATH, QRELS_PATH 而已
48
+ ```
49
+ eval的使用方法
50
+ ```
51
+ PYTHONNOUSERSITE=1 bash <測試的腳本路徑> 512 2048 <每台gpu測多少batchsize> <使用幾個gpu> wmean causal <測試的資料集名稱> <儲存時的子資料夾名稱(辨識用,否則都是日期)> <model dir路徑> <goat lora safetensors路徑,沒有就空著,注意這個要指到 .safetensors 不是dir>
52
+ ```
53
+ eval.sh跟eval2.sh的使用方法
54
+ ```
55
+ <測試的資料集名稱> 最多可以指定6個資料集 "ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA" 以逗號隔開
56
+
57
+ e.g.
58
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
59
+
60
+ e.g.
61
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval2.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
62
+
63
+ e.g.
64
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval2.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR" "/goat/lora/path/如果有的話.safetensors"
65
+ ```
66
+ evalreal.sh跟evalreal2.sh跟evalreal3.sh的使用方法
67
+ ```
68
+ <測試的資料集名稱> 不重要, 但至少需指定一個, 腳本不會使用這個input但會被用在取名上, 但建議是用 "ChartQA" 這個, 你打六個的話她會重複測六次
69
+
70
+ e.g.
71
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
72
+ e.g.
73
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal2.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
74
+ e.g.
75
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal3.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
76
+ ```
77
+ eval的輔助腳本(如果忘記紀錄的話, 只要有.trec檔案就可以重新計算)
78
+ ```
79
+ 可使用 "/src/把visrag_eval_trec_算出三個分.py",
80
+ 假設:
81
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/ArxivQA
82
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/ChartQA
83
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/MP-DocVQA
84
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/InfoVQA
85
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/PlotQA
86
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/SlideVQA
87
+
88
+ 只要改腳本中的 這三個就可以
89
+ ROOT=/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/
90
+ CACHE_DIR = "/path/to/save/your/datasets/"
91
+ 要看什麼 = 'recall_10' or 要看什麼 = 'recip_rank'
92
+
93
+ => 就可以重現結果
94
+ ```
95
+
96
+ ## train指令
97
+ train 指令跟 eval 指令不同就一個腳本而已. "/VisRAG/scripts/train_retriever/train.sh"
98
+ ```
99
+ train.sh 裡面唯一需要手動改的只有 --save_steps 500. 你需要設定 你要多少steps保存一個 checkpoints
100
+ 訓練完後他會保存在 /VisRAG/checkpoints/... 裡面
101
+ ```
102
+ train 指令大致使用方法
103
+ ```
104
+ "有些不需要動的我就不標示了"
105
+ PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
106
+ <每個gpu的batchsize> <gpu數量> 0.02 1 true false config/deepspeed.json \
107
+ 1e-5 false wmean causal 1 true <minibatch> false \
108
+ <乾淨的模型dir> <資料集名稱> \
109
+ <是否要使用degradation> <是否要訓練小的資料集就好> <現在是stage1還是stage2> <有沒有使用lora>
110
+
111
+ 2048是passage使用的token數量
112
+ 0.02是訓練時的溫度, 仿visrag
113
+ 1e-5是learning-rate, 仿visrag
114
+ <資料集名稱>: 就兩個 "openbmb/VisRAG-Ret-Train-In-domain-data" or "openbmb/VisRAG-Ret-Train-Synthetic-data"
115
+ <minibatch>: 可設1,2,3,4,..., 意思是要inference batchsize 個 sample 其實是由 minibatch 累積達到的. 設越大越耗vram
116
+ <是否要使用degradation>: true or false, 設true就好, 就是說在finetune的時候要不要把資料變成degraded的
117
+ <是否要訓練小的資料集就好>: true or false, 設false就好, 設true的話她只會train前30k筆, 可以去"/VisRAG/src/openmatch/dataset/train_dataset.py"改30k成其他數字
118
+ <現在是stage1還是stage2>: "stage1" or "stage2": 用來標示現在是在訓練 "stage1" 還是 "stage2" 注意不能打錯
119
+ <有沒有使用lora>: "None" or "任意的非None字段": 用來標示有沒有用到lora
120
+ ```
121
+ stage1 訓練
122
+ ```
123
+ e.g.
124
+ PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
125
+ 16 2 0.02 1 true false config/deepspeed.json \
126
+ 1e-5 false wmean causal 1 true 1 false \
127
+ '/mnt/191/a/lyw/VisRAG/openbmb/VisIR' 'openbmb/VisRAG-Ret-Train-In-domain-data' \
128
+ true false 'stage1' 'None'
129
+ ```
130
+ stage2 訓練
131
+ ```
132
+ e.g.
133
+ PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
134
+ 16 2 0.02 1 true false config/deepspeed.json \
135
+ 1e-5 false wmean causal 1 true 1 false \
136
+ '/path/to/stage1/訓練完的model/dir' 'openbmb/VisRAG-Ret-Train-In-domain-data' \
137
+ true false 'stage2' '有用lora'
138
+ ```
139
+
140
+
141
+ # 程式碼部分
142
+
143
+ ## GOAT CONFIG 的部分:
144
+ ```
145
+ 注意我的GOAT 的 config 用幾個 experts 那些參數 是在 code 裡面做 setting 的
146
+ 可以參考"/VisRAG/src/openmatch/modeling/dense_retrieval_model.py": build()
147
+ 裡面有一行, goat_config = GoatV3Config(), 所以基本上目前要自己記錄下來 config 是怎麼調整
148
+ 或者你可以去改"/VisRAG/src/openmatch/modeling/goat_injector.py" 裡面的 GoatV3Config 的 default 參數
149
+ ```
150
+
151
+ ## 訓練時的程式碼:
152
+ ```
153
+ 1. "/VisRAG/src/openmatch/driver/train.py": 主要的訓練腳本
154
+ 2. "/VisRAG/src/openmatch/arguments.py": 想確認arguments有哪些可以去那邊看
155
+ 3. "/VisRAG/src/openmatch/modeling/modeling_visir/modeling_vis_ir.py": 我模型的定義
156
+ 4. "/VisRAG/src/openmatch/modeling/goat_injector.py": goat moe lora的實現
157
+ 5. "/VisRAG/src/openmatch/dataset/train_dataset.py":line213(def get_process_fn): 那邊定義我是怎處理huggingface訓練資料集
158
+ 6. "/VisRAG/src/openmatch/modeling/dense_retrieval_model.py": build(): 在這裡我定義我是怎麼拿到要訓練的model的
159
+ 注意, 他是先用DRModel把 retrieval model 給包起來, 所以在程式碼中看到 model(...)的時候, 其實是由 DRModel的forward處理
160
+ 可參考 DRModel的 forward() & encode()
161
+ 7. "/VisRAG/src/openmatch/trainer/dense_trainer.py": _save(...): 在這邊定義我是怎麼儲存model的
162
+ 8. "/VisRAG/src/openmatch/trainer/dense_trainer.py": training_step(...): 在這邊定義訓練model的過程
163
+ 因為採用 minibatch 的方式, 他會先全部inference一次拿到cache, 再重新inference計算loss
164
+ ```
165
+
166
+ ## eval時的程式碼:
167
+ ```
168
+ 1. "/VisRAG/src/openmatch/driver/eval.py": 主要的推理腳本
169
+ 2. "/VisRAG/src/openmatch/arguments.py": 想確認arguments有哪些可以去那邊看
170
+ eval 的時候沒用到 train_args, 可以用 if train_args: 來區分是否是訓練
171
+ 3. "/VisRAG/src/openmatch/modeling/modeling_visir/modeling_vis_ir.py": 我模型的定義
172
+ 4. "/VisRAG/src/openmatch/modeling/goat_injector.py": goat moe lora的實現
173
+ 5. "/VisRAG/src/openmatch/dataset/inference_dataset.py": 在這邊跟inference處理資料有關
174
+ 注意: 這邊我新增一個, "特殊處理用" 的 variable, 就是把 realworld datasets 全部轉正向這樣
175
+ 6. "/VisRAG/src/openmatch/retriever/dense_retriever.py": 在這邊定義了他是怎麼先把embedding存下來, 最後再算分
176
+ ```
177
+
178
+ ## 模型架構圖
179
+ ```
180
+ 乾淨沒加可參考 "/mnt/191/a/lyw/visrag2/VisIR(模型架構).txt"
181
+ 加了GOAT可參考 "/mnt/191/a/lyw/visrag2/VisIR+GOAT(模型架構).txt"
182
+ 想調整GOAT架構可以改
183
+ "/VisRAG/src/openmatch/modeling/goat_injector.py" 裡面的 GoatV3Config 或者
184
+ "/VisRAG/src/openmatch/modeling/dense_retrieval_model.py" 直接在 build(...) 的程式碼裡面更改加載邏輯
185
+ ```
scripts/dataset.py ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # datasets
2
+ """
3
+ # load datasets(train)的方法:
4
+ from datasets import load_dataset
5
+ db = load_dataset(...)["train"]
6
+ for x in db:
7
+ # x 是一個 set{}, , e.g.
8
+ # {"corpus-id": "6519.png", "image": <PIL.PngImagePlugin.PngImageFile\
9
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
10
+ ...
11
+ ## load datasets(test)的方法:
12
+ from datasets import load_dataset
13
+ dbcorpus = load_dataset(..., "corpus")["train"]
14
+ dbqrels = load_dataset(..., "qrels")["train"]
15
+ dbqueries = load_dataset(..., "queries")["train"]
16
+ ## 如果是圖片集合
17
+ for x in dbcorpus:
18
+ # x 是一個 set{}, , e.g.
19
+ # {"corpus-id": "圖片的id", "image": <PIL.PngImagePlugin.PngImageFile\
20
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
21
+ ...
22
+ for x in dbqrels:
23
+ # x 是一個 set{}, , e.g.
24
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
25
+ ...
26
+ for x in dbqueries:
27
+ # x 是一個 set{}, , e.g.
28
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
29
+ ...
30
+ ## 如果是OCR資料集
31
+ for x in dbcorpus:
32
+ # x 是一個 set{}, , e.g.
33
+ # {"corpus-id": "6519.png", "text": "string to describe a photo"}
34
+ ...
35
+ for x in dbqrels:
36
+ # x 是一個 set{}, , e.g.
37
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
38
+ ...
39
+ for x in dbqueries:
40
+ # x 是一個 set{}, , e.g.
41
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
42
+ ...
43
+ """
44
+
45
+ from datasets import load_dataset
46
+
47
+ save_root = r"/group-volume/Human-Action-Analysis/users/hsiang.chen/Robust/datasets/"
48
+ # Train datasets:
49
+ ## arxiv, plotqa, ... 的122k的indomain資料集
50
+ load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", cache_dir=save_root)["train"]
51
+ ## 合成的239k的資料集
52
+ load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", cache_dir=save_root)["train"]
53
+
54
+ # Test datasets: (每個test datasets分3個split)(有圖片版本 跟 OCR版本)
55
+ # 圖片版本
56
+ ## 乾淨的PlotQA
57
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "corpus", cache_dir=save_root)["train"]
58
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
59
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
60
+ ## 乾淨的SlideVQA
61
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "corpus", cache_dir=save_root)["train"]
62
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
63
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
64
+ ## 乾淨的InfoVQA
65
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "corpus", cache_dir=save_root)["train"]
66
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
67
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
68
+ ## 乾淨的ArxivQA
69
+ oad_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "corpus", cache_dir=save_root)["train"]
70
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
71
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
72
+ ## 乾淨的ChartQA
73
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "corpus", cache_dir=save_root)["train"]
74
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
75
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
76
+ ## 乾淨的MP-DocVQA
77
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "corpus", cache_dir=save_root)["train"]
78
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
79
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
80
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
81
+ load_dataset("rweics5cs7/exo3-original-PlotQA-deg", "corpus", cache_dir=save_root)["train"]
82
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
83
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
84
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
85
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-deg", "corpus", cache_dir=save_root)["train"]
86
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
87
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
88
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
89
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-deg", "corpus", cache_dir=save_root)["train"]
90
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
91
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
92
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
93
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-deg", "corpus", cache_dir=save_root)["train"]
94
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
95
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
96
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
97
+ load_dataset("rweics5cs7/exo3-original-ChartQA-deg", "corpus", cache_dir=save_root)["train"]
98
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
99
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
100
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
101
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-deg", "corpus", cache_dir=save_root)["train"]
102
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
103
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
104
+
105
+ ## rvl cdip (3k) 乾淨的
106
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
107
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
108
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "queries", cache_dir=save_root)["train"]
109
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
110
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "corpus", cache_dir=save_root)["train"]
111
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "qrels", cache_dir=save_root)["train"]
112
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "queries", cache_dir=save_root)["train"]
113
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
114
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
115
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
116
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "queries", cache_dir=save_root)["train"]
117
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
118
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
119
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
120
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "queries", cache_dir=save_root)["train"]
121
+
122
+ # # OCR版本 (PPOCR-v5)
123
+ # ## 乾淨的PlotQA
124
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-text", "corpus", cache_dir=save_root)["train"]
125
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
126
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
127
+ # ## 乾淨的SlideVQA
128
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-text", "corpus", cache_dir=save_root)["train"]
129
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
130
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
131
+ # ## 乾淨的InfoVQA
132
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-text", "corpus", cache_dir=save_root)["train"]
133
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
134
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
135
+ # ## 乾淨的ArxivQA
136
+ # oad_dataset("rweics5cs7/exo3-original-ArxivQA-text", "corpus", cache_dir=save_root)["train"]
137
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
138
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
139
+ # ## 乾淨的ChartQA
140
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-text", "corpus", cache_dir=save_root)["train"]
141
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
142
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
143
+ # ## 乾淨的MP-DocVQA
144
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text", "corpus", cache_dir=save_root)["train"]
145
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
146
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
147
+ # ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
148
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg", "corpus", cache_dir=save_root)["train"]
149
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
150
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
151
+ # ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
152
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg", "corpus", cache_dir=save_root)["train"]
153
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
154
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
155
+ # ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
156
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg", "corpus", cache_dir=save_root)["train"]
157
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
158
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
159
+ # ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
160
+ # load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg", "corpus", cache_dir=save_root)["train"]
161
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
162
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
163
+ # ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
164
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg", "corpus", cache_dir=save_root)["train"]
165
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
166
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
167
+ # ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
168
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg", "corpus", cache_dir=save_root)["train"]
169
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
170
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
171
+
172
+ # ## rvl cdip (3k) 乾淨的
173
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
174
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
175
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
176
+ # ## rvl cdip (REALWORLD) (3k) degraded realworld
177
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "corpus", cache_dir=save_root)["train"]
178
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "qrels", cache_dir=save_root)["train"]
179
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "queries", cache_dir=save_root)["train"]
180
+ # ## MP-DocVQA (REALWORLD) (741) degraded realworld
181
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
182
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
183
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
184
+ # ## ArxivQA (REALWORLD) (3000) degraded realworld
185
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
186
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
187
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
188
+
189
+ # OCR版本 (PPOCR-v3)
190
+ ## 乾淨的PlotQA
191
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-v3", "corpus", cache_dir=save_root)["train"]
192
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
193
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
194
+ ## 乾淨的SlideVQA
195
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-v3", "corpus", cache_dir=save_root)["train"]
196
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
197
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
198
+ ## 乾淨的InfoVQA
199
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-v3", "corpus", cache_dir=save_root)["train"]
200
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
201
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
202
+ ## 乾淨的ArxivQA
203
+ oad_dataset("rweics5cs7/exo3-original-ArxivQA-text-v3", "corpus", cache_dir=save_root)["train"]
204
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
205
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
206
+ ## 乾淨的ChartQA
207
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-v3", "corpus", cache_dir=save_root)["train"]
208
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
209
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
210
+ ## 乾淨的MP-DocVQA
211
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-v3", "corpus", cache_dir=save_root)["train"]
212
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
213
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
214
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
215
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
216
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
217
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
218
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
219
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
220
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
221
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
222
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
223
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
224
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
225
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
226
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
227
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
228
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
229
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
230
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
231
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
232
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
233
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
234
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
235
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
236
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
237
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
238
+
239
+ ## rvl cdip (3k) 乾淨的
240
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
241
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
242
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
243
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
244
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "corpus", cache_dir=save_root)["train"]
245
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "qrels", cache_dir=save_root)["train"]
246
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "queries", cache_dir=save_root)["train"]
247
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
248
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
249
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
250
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
251
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
252
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
253
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
254
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]