AaronCIH commited on
Commit
4f523e3
·
verified ·
1 Parent(s): ea87341

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. Project/Doc/code.md +195 -0
  2. Project/Doc/data.md +390 -0
  3. Project/Doc/dataset.py +265 -0
Project/Doc/code.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # 跑通
3
+ conda activate /group-volume/Human-Action-Analysis/users/hsiang.chen/envs/VisRAG
4
+ pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
5
+ pip install transformers==4.40.2
6
+ conda install -c nvidia cuda-toolkit=11.8 cuda-nvcc=11.8
7
+ If pymupdf 失敗,
8
+ python -m pip install -U pip setuptools wheel packaging build
9
+ pip cache purge
10
+ pip install --only-binary=:all: PyMuPDF
11
+ 安裝剩下的 requirement.txt
12
+
13
+
14
+ ## 安裝VisRAG環境
15
+ 安裝VisRAG.
16
+ ```
17
+ git clone https://github.com/OpenBMB/VisRAG.git
18
+ conda create --name VisRAG python==3.10.8
19
+ conda activate VisRAG
20
+ conda install nvidia/label/cuda-11.8.0::cuda-toolkit
21
+ cd VisRAG
22
+ pip install -r requirements.txt
23
+ pip install -e .
24
+ cd timm_modified
25
+ pip install -e .
26
+ cd ..
27
+ ```
28
+ 0. 把原先的 "/VisRAG/src", "/VisRAG/script" 更名或刪除, 換成我寫的 "/mnt/191/a/lyw/visrag2/VisRAG/src" , "/mnt/191/a/lyw/visrag2/VisRAG/scripts"
29
+ 1. pip install peft==0.10.0 # lora (印象太高版本無法跑, 0.10.0 還行)
30
+ 2. pip install scikit-image==0.25.2 # degradation 會用到
31
+ 3. pip install opencv-python==4.11.0.86 # degradation 會用到
32
+
33
+ 4. 改掉 training script 跟 eval script 的路徑
34
+ ```
35
+ a. 改掉 "/VisRAG/scripts/train_retriever/train.sh" 裡面的 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/lib" 跟 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/bin/python" 換成你conda的路徑.(共兩處)
36
+ b. 改掉 "/VisRAG/scripts/eval_retriever"的: eval.sh, eval2.sh, evalreal.sh, evalreal2.sh, evalreal3.sh 裡面的 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/bin/python"換成你的 python 路徑.(共兩處)
37
+ ```
38
+ 5. 去 "/VisRAG/src/openmatch/arguments.py" 改 "*cache_dir" 成自己的路徑 (可用ctrl+f 尋找 /mnt/191/ 有兩個)
39
+
40
+
41
+ 6. 下載 我們的model(VisIR)(LLM+vpm1+vpm2+zero_linear)
42
+ ```
43
+ a. VisIR(乾淨未訓練, zero_linear全0): "/mnt/191/a/lyw/VisRAG/openbmb/VisIR" (整個資料夾都要)
44
+ b. VisIR(stage1): "/mnt/191/a/lyw/VisRAG/checkpoints/train-2025-06-19-065530-model-data-lr-1e-5-softm_temp-0.02-bsz16-ngpus4-nnodes1-inbatch--nepoch-1-pooling-wmean-attention-causal-qinstruct-true-cinstruct-false-gradcache-true-passage-stopgrad-false-npassage-1" (整個資料夾都要)
45
+ c. VisIR(stage1+2): "/mnt/191/a/lyw/VisRAG/checkpoints/train-2025-06-19-194437-model-data-lr-1e-5-softm_temp-0.02-bsz16-ngpus4-nnodes1-inbatch--nepoch-1-pooling-wmean-attention-causal-qinstruct-true-cinstruct-false-gradcache-true-passage-stopgrad-false-npassage-1" (整個資料夾都要)
46
+ ```
47
+
48
+ ## eval指令
49
+ eval腳本的差異如下
50
+ ```
51
+ eval script 有五個, 分別對應5個測試資料集
52
+ 1. eval.sh: 可測乾淨的 ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA
53
+ 2. eval2.sh: 可測degraded的 ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA
54
+ 3. evalreal.sh: 可測乾淨的 rvl-cdip
55
+ 4. evalreal2.sh: 可測realworld的 rvl-cdip
56
+ 5. evalreal3.sh: 可測realworld的 MP-DocVQA
57
+ 但其實他們的差別只有在CHECKPOINT_DIR(儲存測試結果位置), CORPUS_PATH, QUERY_PATH, QRELS_PATH 而已
58
+ ```
59
+ eval的使用方法
60
+ ```
61
+ PYTHONNOUSERSITE=1 bash <測試的腳本路徑> 512 2048 <每台gpu測多少batchsize> <使用幾個gpu> wmean causal <測試的資料集名稱> <儲存時的子資料夾名稱(辨識用,否則都是日期)> <model dir路徑> <goat lora safetensors路徑,沒有就空著,注意這個要指到 .safetensors 不是dir>
62
+ ```
63
+ eval.sh跟eval2.sh的使用方法
64
+ ```
65
+ <測試的資料集名稱> 最多可以指定6個資料集 "ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA" 以逗號隔開
66
+
67
+ e.g.
68
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
69
+
70
+ e.g.
71
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval2.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
72
+
73
+ e.g.
74
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval2.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR" "/goat/lora/path/如果有的話.safetensors"
75
+ ```
76
+ evalreal.sh跟evalreal2.sh跟evalreal3.sh的使用方法
77
+ ```
78
+ <測試的資料集名稱> 不重要, 但至少需指定一個, 腳本不會使用這個input但會被用在取名上, 但建議是用 "ChartQA" 這個, 你打六個的話她會重複測六次
79
+
80
+ e.g.
81
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
82
+ e.g.
83
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal2.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
84
+ e.g.
85
+ PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal3.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
86
+ ```
87
+ eval的輔助腳本(如果忘記紀錄的���, 只要有.trec檔案就可以重新計算)
88
+ ```
89
+ 可使用 "/src/把visrag_eval_trec_算出三個分.py",
90
+ 假設:
91
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/ArxivQA
92
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/ChartQA
93
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/MP-DocVQA
94
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/InfoVQA
95
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/PlotQA
96
+ /VisRAG/evaloutput/乾淨visir測試/eval-...-abc/SlideVQA
97
+
98
+ 只要改腳本中的 這三個就可以
99
+ ROOT=/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/
100
+ CACHE_DIR = "/path/to/save/your/datasets/"
101
+ 要看什麼 = 'recall_10' or 要看什麼 = 'recip_rank'
102
+
103
+ => 就可以重現結果
104
+ ```
105
+
106
+ ## train指令
107
+ train 指令跟 eval 指令不同就一個腳本而已. "/VisRAG/scripts/train_retriever/train.sh"
108
+ ```
109
+ train.sh 裡面唯一需要手動改的只有 --save_steps 500. 你需要設定 你要多少steps保存一個 checkpoints
110
+ 訓練完後他會保存在 /VisRAG/checkpoints/... 裡面
111
+ ```
112
+ train 指令大致使用方法
113
+ ```
114
+ "有些不需要動的我就不標示了"
115
+ PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
116
+ <每個gpu的batchsize> <gpu數量> 0.02 1 true false config/deepspeed.json \
117
+ 1e-5 false wmean causal 1 true <minibatch> false \
118
+ <乾淨的模型dir> <資料集名稱> \
119
+ <是否要使用degradation> <是否要訓練小的資料集就好> <現在是stage1還是stage2> <有沒有使用lora>
120
+
121
+ 2048是passage使用的token數量
122
+ 0.02是訓練時的溫度, 仿visrag
123
+ 1e-5是learning-rate, 仿visrag
124
+ <資料集名稱>: 就兩個 "openbmb/VisRAG-Ret-Train-In-domain-data" or "openbmb/VisRAG-Ret-Train-Synthetic-data"
125
+ <minibatch>: 可設1,2,3,4,..., 意思是要inference batchsize 個 sample 其實是由 minibatch 累積達到的. 設越大越耗vram
126
+ <是否要使用degradation>: true or false, 設true就好, 就是說在finetune的時候要不要把資料變成degraded的
127
+ <是否要訓練小的資料集就好>: true or false, 設false就好, 設true的話她只會train前30k筆, 可以去"/VisRAG/src/openmatch/dataset/train_dataset.py"改30k成其他數字
128
+ <現在是stage1還是stage2>: "stage1" or "stage2": 用來標示現在是在訓練 "stage1" 還是 "stage2" 注意不能打錯
129
+ <有沒有使用lora>: "None" or "任意的非None字段": 用來標示有沒有用到lora
130
+ ```
131
+ stage1 訓練
132
+ ```
133
+ e.g.
134
+ PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
135
+ 16 2 0.02 1 true false config/deepspeed.json \
136
+ 1e-5 false wmean causal 1 true 1 false \
137
+ '/mnt/191/a/lyw/VisRAG/openbmb/VisIR' 'openbmb/VisRAG-Ret-Train-In-domain-data' \
138
+ true false 'stage1' 'None'
139
+ ```
140
+ stage2 訓練
141
+ ```
142
+ e.g.
143
+ PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
144
+ 16 2 0.02 1 true false config/deepspeed.json \
145
+ 1e-5 false wmean causal 1 true 1 false \
146
+ '/path/to/stage1/訓練完的model/dir' 'openbmb/VisRAG-Ret-Train-In-domain-data' \
147
+ true false 'stage2' '有用lora'
148
+ ```
149
+
150
+
151
+ # 程式碼部分
152
+
153
+ ## GOAT CONFIG 的部分:
154
+ ```
155
+ 注意我的GOAT 的 config 用幾個 experts 那些參數 是在 code 裡面做 setting 的
156
+ 可以參考"/VisRAG/src/openmatch/modeling/dense_retrieval_model.py": build()
157
+ 裡面有一行, goat_config = GoatV3Config(), 所以基本上目前要自己記錄下來 config 是怎麼調整
158
+ 或者你可以去改"/VisRAG/src/openmatch/modeling/goat_injector.py" 裡面的 GoatV3Config 的 default 參數
159
+ ```
160
+
161
+ ## 訓練時的程式碼:
162
+ ```
163
+ 1. "/VisRAG/src/openmatch/driver/train.py": 主要的訓練腳本
164
+ 2. "/VisRAG/src/openmatch/arguments.py": 想確認arguments有哪些可以去那邊看
165
+ 3. "/VisRAG/src/openmatch/modeling/modeling_visir/modeling_vis_ir.py": 我模型的定義
166
+ 4. "/VisRAG/src/openmatch/modeling/goat_injector.py": goat moe lora的實現
167
+ 5. "/VisRAG/src/openmatch/dataset/train_dataset.py":line213(def get_process_fn): 那邊定義我是怎處理huggingface訓練資料集
168
+ 6. "/VisRAG/src/openmatch/modeling/dense_retrieval_model.py": build(): 在這裡我定義我是怎麼拿到要訓練的model的
169
+ 注意, 他是先用DRModel把 retrieval model 給包起來, 所以在程式碼中看到 model(...)的時候, 其實是由 DRModel的forward處理
170
+ 可參考 DRModel的 forward() & encode()
171
+ 7. "/VisRAG/src/openmatch/trainer/dense_trainer.py": _save(...): 在這邊定義我是怎麼儲存model的
172
+ 8. "/VisRAG/src/openmatch/trainer/dense_trainer.py": training_step(...): 在這邊定義訓練model的過程
173
+ 因為採用 minibatch 的方式, 他會先全部inference一次拿到cache, 再重新inference計算loss
174
+ ```
175
+
176
+ ## eval時的程式碼:
177
+ ```
178
+ 1. "/VisRAG/src/openmatch/driver/eval.py": 主要的推理腳本
179
+ 2. "/VisRAG/src/openmatch/arguments.py": 想確認arguments有哪些可以去那邊看
180
+ eval 的時候沒用到 train_args, 可以用 if train_args: 來區分是否是訓練
181
+ 3. "/VisRAG/src/openmatch/modeling/modeling_visir/modeling_vis_ir.py": 我模型的定義
182
+ 4. "/VisRAG/src/openmatch/modeling/goat_injector.py": goat moe lora的實現
183
+ 5. "/VisRAG/src/openmatch/dataset/inference_dataset.py": 在這邊跟inference處理資料有關
184
+ 注意: 這邊我新增一個, "特殊處理用" 的 variable, 就是把 realworld datasets 全部轉正向這樣
185
+ 6. "/VisRAG/src/openmatch/retriever/dense_retriever.py": 在這邊定義了他是怎麼先把embedding存下來, 最後再算分
186
+ ```
187
+
188
+ ## 模型架構圖
189
+ ```
190
+ 乾淨沒加可參考 "/mnt/191/a/lyw/visrag2/VisIR(模型架構).txt"
191
+ 加了GOAT可參考 "/mnt/191/a/lyw/visrag2/VisIR+GOAT(模型架構).txt"
192
+ 想調整GOAT架構可以改
193
+ "/VisRAG/src/openmatch/modeling/goat_injector.py" 裡面的 GoatV3Config 或者
194
+ "/VisRAG/src/openmatch/modeling/dense_retrieval_model.py" 直接在 build(...) 的程式碼裡面更改加載邏輯
195
+ ```
Project/Doc/data.md ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # datasets
2
+
3
+ ## load datasets(train)的方法:
4
+ ```
5
+ from datasets import load_dataset
6
+ db = load_dataset(...)["train"]
7
+ for x in db:
8
+ # x 是一個 set{}, , e.g.
9
+ # {"corpus-id": "6519.png", "image": <PIL.PngImagePlugin.PngImageFile\
10
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
11
+ ...
12
+ ```
13
+ ## load datasets(test)的方法:
14
+ ```
15
+ from datasets import load_dataset
16
+ dbcorpus = load_dataset(..., "corpus")["train"]
17
+ dbqrels = load_dataset(..., "qrels")["train"]
18
+ dbqueries = load_dataset(..., "queries")["train"]
19
+ ```
20
+ ## 如果是圖片集合
21
+ ```
22
+ for x in dbcorpus:
23
+ # x 是一個 set{}, , e.g.
24
+ # {"corpus-id": "圖片的id", "image": <PIL.PngImagePlugin.PngImageFile\
25
+ # ex. {'corpus-id': '2010.05458_3.jpg', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1122x551 at 0x7F57A3667790>}
26
+ ...
27
+ for x in dbqrels:
28
+ # x 是一個 set{}, , e.g.
29
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
30
+ # ex. {'query-id': '1508.06771_0.jpg-1', 'corpus-id': '1508.06771_0.jpg', 'score': 1}
31
+ ...
32
+ for x in dbqueries:
33
+ # x 是一個 set{}, , e.g.
34
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
35
+ # ex. {'query-id': '1508.06771_0.jpg-1',
36
+ 'query': "Which statement best describes the relationship between the components labeled 'Myosin' and 'Actin filament'?", 'answer': 'C',
37
+ 'options': ['A) Myosin binds directly to crosslinkers.', 'B) Actin filaments are independent of myosin.', 'C) Myosin heads are bound to actin filaments.', 'D) Crosslinkers prevent the interaction between myosin and actin filaments.'], 'is_numerical': 0}
38
+ ...
39
+ ```
40
+
41
+ ## 如果是OCR資料集
42
+ ```
43
+ for x in dbcorpus:
44
+ # x 是一個 set{}, , e.g.
45
+ # {"corpus-id": "6519.png", "text": "string to describe a photo"}
46
+ ...
47
+ for x in dbqrels:
48
+ # x 是一個 set{}, , e.g.
49
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
50
+ ...
51
+ for x in dbqueries:
52
+ # x 是一個 set{}, , e.g.
53
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
54
+ ...
55
+ ```
56
+
57
+ # Dataset
58
+ =========================================================================
59
+ Training:
60
+ openbmb/VisRAG-Ret-Train-In-domain-data # 122,752
61
+ openbmb/VisRAG-Ret-Train-Synthetic-data # 239,358
62
+ Testing:
63
+ --Clean:
64
+ openbmb/VisRAG-Ret-Test-PlotQA (corpus, qrels, queries)
65
+ openbmb/VisRAG-Ret-Test-SlideVQA (corpus, qrels, queries)
66
+ openbmb/VisRAG-Ret-Test-InfoVQA (corpus, qrels, queries)
67
+ openbmb/VisRAG-Ret-Test-ArxivQA (corpus, qrels, queries)
68
+ openbmb/VisRAG-Ret-Test-ChartQA (corpus, qrels, queries)
69
+ openbmb/VisRAG-Ret-Test-MP-DocVQA (corpus, qrels, queries)
70
+ --Degradation Image (using clean qrels & queries)
71
+ rweics5cs7/exo3-original-PlotQA-deg (corpus)
72
+ rweics5cs7/exo3-original-SlideVQA-deg (corpus)
73
+ rweics5cs7/exo3-original-InfoVQA-deg (corpus)
74
+ rweics5cs7/exo3-original-ArxivQA-deg (corpus)
75
+ rweics5cs7/exo3-original-ChartQA-deg (corpus)
76
+ rweics5cs7/exo3-original-MP-DocVQA-deg (corpus)
77
+ --Real-World
78
+ rweics5cs7/exo7-realworld-db-combined (corpus, qrels, queries) rvl cdip (3k) 乾淨的
79
+ rweics5cs7/exo7-realworld-db-combined-deg (corpus, qrels, queries) rvl cdip (REALWORLD) (3k) degraded
80
+ rweics5cs7/exo9-realworld-db-combined (corpus, qrels, queries) MP-DocVQA (REALWORLD) (741) degraded
81
+ rweics5cs7/exo10-realworld-db-combined (corpus, qrels, queries) ArxivQA (REALWORLD) (3000) degraded
82
+ =========================================================================
83
+
84
+
85
+ # Train datasets:
86
+ ## arxiv, plotqa, ... 的122k的indomain資料集
87
+ ```
88
+ load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
89
+ ```
90
+ ## 合成的239k的資料集
91
+ ```
92
+ load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
93
+ ```
94
+ # Test datasets: (每個test datasets分3個split)(有圖片版本 跟 OCR版本)
95
+ # 圖片版本
96
+ ## 乾淨的PlotQA
97
+ ```
98
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
99
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
100
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
101
+ ```
102
+ ## 乾淨的SlideVQA
103
+ ```
104
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
105
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
106
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
107
+ ```
108
+ ## 乾淨的InfoVQA
109
+ ```
110
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
111
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
112
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
113
+ ```
114
+ ## 乾淨的ArxivQA
115
+ ```
116
+ oad_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
117
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
118
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
119
+ ```
120
+ ## 乾淨的ChartQA
121
+ ```
122
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
123
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
124
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
125
+ ```
126
+ ## 乾淨的MP-DocVQA
127
+ ```
128
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
129
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
130
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
131
+ ```
132
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
133
+ ```
134
+ load_dataset("rweics5cs7/exo3-original-PlotQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
135
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
136
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
137
+ ```
138
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
139
+ ```
140
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
141
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
142
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
143
+ ```
144
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
145
+ ```
146
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
147
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
148
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
149
+ ```
150
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
151
+ ```
152
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
153
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
154
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
155
+ ```
156
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
157
+ ```
158
+ load_dataset("rweics5cs7/exo3-original-ChartQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
159
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
160
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
161
+ ```
162
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
163
+ ```
164
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
165
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
166
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
167
+
168
+ ```
169
+ ## rvl cdip (3k) 乾淨的
170
+ ```
171
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
172
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
173
+ load_dataset("rweics5cs7/exo7-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
174
+ ```
175
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
176
+ ```
177
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
178
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
179
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
180
+ ```
181
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
182
+ ```
183
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
184
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
185
+ load_dataset("rweics5cs7/exo9-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
186
+ ```
187
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
188
+ ```
189
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
190
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
191
+ load_dataset("rweics5cs7/exo10-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
192
+ ```
193
+
194
+ # OCR版本 (PPOCR-v5)
195
+ ## 乾淨的PlotQA
196
+ ```
197
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
198
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
199
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
200
+ ```
201
+ ## 乾淨的SlideVQA
202
+ ```
203
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
204
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
205
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
206
+ ```
207
+ ## 乾淨的InfoVQA
208
+ ```
209
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
210
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
211
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
212
+ ```
213
+ ## 乾淨的ArxivQA
214
+ ```
215
+ oad_dataset("rweics5cs7/exo3-original-ArxivQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
216
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
217
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
218
+ ```
219
+ ## 乾淨的ChartQA
220
+ ```
221
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
222
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
223
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
224
+ ```
225
+ ## 乾淨的MP-DocVQA
226
+ ```
227
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
228
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
229
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
230
+ ```
231
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
232
+ ```
233
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
234
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
235
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
236
+ ```
237
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
238
+ ```
239
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
240
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
241
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
242
+ ```
243
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
244
+ ```
245
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
246
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
247
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
248
+ ```
249
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
250
+ ```
251
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
252
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
253
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
254
+ ```
255
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
256
+ ```
257
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
258
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
259
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
260
+ ```
261
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
262
+ ```
263
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
264
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
265
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
266
+
267
+ ```
268
+ ## rvl cdip (3k) 乾淨的
269
+ ```
270
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
271
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
272
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
273
+ ```
274
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
275
+ ```
276
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
277
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
278
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
279
+ ```
280
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
281
+ ```
282
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
283
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
284
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
285
+ ```
286
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
287
+ ```
288
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
289
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
290
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
291
+ ```
292
+
293
+ # OCR版本 (PPOCR-v3)
294
+ ## 乾淨的PlotQA
295
+ ```
296
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
297
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
298
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
299
+ ```
300
+ ## 乾淨的SlideVQA
301
+ ```
302
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
303
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
304
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
305
+ ```
306
+ ## 乾淨的InfoVQA
307
+ ```
308
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
309
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
310
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
311
+ ```
312
+ ## 乾淨的ArxivQA
313
+ ```
314
+ oad_dataset("rweics5cs7/exo3-original-ArxivQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
315
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
316
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
317
+ ```
318
+ ## 乾淨的ChartQA
319
+ ```
320
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
321
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
322
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
323
+ ```
324
+ ## 乾淨的MP-DocVQA
325
+ ```
326
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
327
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
328
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
329
+ ```
330
+ ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
331
+ ```
332
+ load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
333
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
334
+ load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
335
+ ```
336
+ ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
337
+ ```
338
+ load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
339
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
340
+ load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
341
+ ```
342
+ ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
343
+ ```
344
+ load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
345
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
346
+ load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
347
+ ```
348
+ ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
349
+ ```
350
+ load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
351
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
352
+ load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
353
+ ```
354
+ ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
355
+ ```
356
+ load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
357
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
358
+ load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
359
+ ```
360
+ ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
361
+ ```
362
+ load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
363
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
364
+ load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
365
+
366
+ ```
367
+ ## rvl cdip (3k) 乾淨的
368
+ ```
369
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
370
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
371
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
372
+ ```
373
+ ## rvl cdip (REALWORLD) (3k) degraded realworld
374
+ ```
375
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
376
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
377
+ load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
378
+ ```
379
+ ## MP-DocVQA (REALWORLD) (741) degraded realworld
380
+ ```
381
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
382
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
383
+ load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
384
+ ```
385
+ ## ArxivQA (REALWORLD) (3000) degraded realworld
386
+ ```
387
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
388
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
389
+ load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
390
+ ```
Project/Doc/dataset.py ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # datasets
2
+ """
3
+ # load datasets(train)的方法:
4
+ from datasets import load_dataset
5
+ db = load_dataset(...)["train"]
6
+ for x in db:
7
+ # x 是一個 set{}, , e.g.
8
+ # {"corpus-id": "6519.png", "image": <PIL.PngImagePlugin.PngImageFile\
9
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
10
+ ...
11
+ ## load datasets(test)的方法:
12
+ from datasets import load_dataset
13
+ dbcorpus = load_dataset(..., "corpus")["train"]
14
+ dbqrels = load_dataset(..., "qrels")["train"]
15
+ dbqueries = load_dataset(..., "queries")["train"]
16
+ ## 如果是圖片集合
17
+ for x in dbcorpus:
18
+ # x 是一個 set{}, , e.g.
19
+ # {"corpus-id": "圖片的id", "image": <PIL.PngImagePlugin.PngImageFile\
20
+ # image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
21
+ ...
22
+ for x in dbqrels:
23
+ # x 是一個 set{}, , e.g.
24
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
25
+ ...
26
+ for x in dbqueries:
27
+ # x 是一個 set{}, , e.g.
28
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
29
+ ...
30
+ ## 如果是OCR資料集
31
+ for x in dbcorpus:
32
+ # x 是一個 set{}, , e.g.
33
+ # {"corpus-id": "6519.png", "text": "string to describe a photo"}
34
+ ...
35
+ for x in dbqrels:
36
+ # x 是一個 set{}, , e.g.
37
+ # {"query-id": "問題的id", "corpus-id": "圖片的id",}
38
+ ...
39
+ for x in dbqueries:
40
+ # x 是一個 set{}, , e.g.
41
+ # {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
42
+ ...
43
+ """
44
+ """
45
+ cd /group-volume/Behaviour-Analysis/users/hsiang.chen/Robust/scripts/ && conda activate /group-volume/Human-Action-Analysis/users/hsiang.chen/envs/autodir/ && python dataset.py
46
+ """
47
+
48
+ from datasets import load_dataset
49
+
50
+ save_root = r"/home/work/shared-fi-datasets-01/users/hsiang.chen/Project/Robust/Dataset"
51
+ # # Train datasets:
52
+ # ## arxiv, plotqa, ... 的122k的indomain資料集
53
+ # load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", cache_dir=save_root)["train"]
54
+ # ## 合成的239k的資料集
55
+ # load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", cache_dir=save_root)["train"]
56
+
57
+ # # Test datasets: (每個test datasets分3個split)(有圖片版本 跟 OCR版本)
58
+ # # 圖片版本
59
+ # ## 乾淨的PlotQA
60
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "corpus", cache_dir=save_root)["train"]
61
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
62
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
63
+ # ## 乾淨的SlideVQA
64
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "corpus", cache_dir=save_root)["train"]
65
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
66
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
67
+ # ## 乾淨的InfoVQA
68
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "corpus", cache_dir=save_root)["train"]
69
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
70
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
71
+ # ## 乾淨的ArxivQA
72
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "corpus", cache_dir=save_root)["train"]
73
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
74
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
75
+ # ## 乾淨的ChartQA
76
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "corpus", cache_dir=save_root)["train"]
77
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
78
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
79
+ # ## 乾淨的MP-DocVQA
80
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "corpus", cache_dir=save_root)["train"]
81
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
82
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
83
+ # ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
84
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-deg", "corpus", cache_dir=save_root)["train"]
85
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
86
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
87
+ # ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
88
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-deg", "corpus", cache_dir=save_root)["train"]
89
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
90
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
91
+ # ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
92
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-deg", "corpus", cache_dir=save_root)["train"]
93
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
94
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
95
+ # ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
96
+ # load_dataset("rweics5cs7/exo3-original-ArxivQA-deg", "corpus", cache_dir=save_root)["train"]
97
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
98
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
99
+ # ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
100
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-deg", "corpus", cache_dir=save_root)["train"]
101
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
102
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
103
+ # ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
104
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-deg", "corpus", cache_dir=save_root)["train"]
105
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
106
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
107
+
108
+ ## rvl cdip (3k) 乾淨的-fixed
109
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-fixed", "corpus", cache_dir=save_root)["train"]
110
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-fixed", "qrels", cache_dir=save_root)["train"]
111
+ load_dataset("rweics5cs7/exo7-realworld-db-combined-fixed", "queries", cache_dir=save_root)["train"]
112
+ # ## rvl cdip (3k) 乾淨的
113
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
114
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
115
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined", "queries", cache_dir=save_root)["train"]
116
+ # ## rvl cdip (REALWORLD) (3k) degraded realworld
117
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "corpus", cache_dir=save_root)["train"]
118
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "qrels", cache_dir=save_root)["train"]
119
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "queries", cache_dir=save_root)["train"]
120
+ # ## rvl cdip (REALWORLD) (3k) degraded realworld (fixed version - 1k query)
121
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined-deg-fixed", "corpus", cache_dir=save_root)["train"]
122
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined-deg-fixed", "qrels", cache_dir=save_root)["train"]
123
+ # load_dataset("rweics5cs7/exo7-realworld-db-combined-deg-fixed", "queries", cache_dir=save_root)["train"]
124
+ # ## MP-DocVQA (REALWORLD) (741) degraded realworld
125
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
126
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
127
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined", "queries", cache_dir=save_root)["train"]
128
+ # ## ArxivQA (REALWORLD) (3000) degraded realworld
129
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
130
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
131
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined", "queries", cache_dir=save_root)["train"]
132
+
133
+ # # OCR版本 (PPOCR-v5)
134
+ # ## 乾淨的PlotQA
135
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-text", "corpus", cache_dir=save_root)["train"]
136
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
137
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
138
+ # ## 乾淨的SlideVQA
139
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-text", "corpus", cache_dir=save_root)["train"]
140
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
141
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
142
+ # ## 乾淨的InfoVQA
143
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-text", "corpus", cache_dir=save_root)["train"]
144
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
145
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
146
+ # ## 乾淨的ArxivQA
147
+ # oad_dataset("rweics5cs7/exo3-original-ArxivQA-text", "corpus", cache_dir=save_root)["train"]
148
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
149
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
150
+ # ## 乾淨的ChartQA
151
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-text", "corpus", cache_dir=save_root)["train"]
152
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
153
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
154
+ # ## 乾淨的MP-DocVQA
155
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text", "corpus", cache_dir=save_root)["train"]
156
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
157
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
158
+ # ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
159
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg", "corpus", cache_dir=save_root)["train"]
160
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
161
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
162
+ # ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
163
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg", "corpus", cache_dir=save_root)["train"]
164
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
165
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
166
+ # ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
167
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg", "corpus", cache_dir=save_root)["train"]
168
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
169
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
170
+ # ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
171
+ # load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg", "corpus", cache_dir=save_root)["train"]
172
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
173
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
174
+ # ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
175
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg", "corpus", cache_dir=save_root)["train"]
176
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
177
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
178
+ # ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
179
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg", "corpus", cache_dir=save_root)["train"]
180
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
181
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
182
+
183
+ # ## rvl cdip (3k) 乾淨的
184
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
185
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
186
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
187
+ # ## rvl cdip (REALWORLD) (3k) degraded realworld
188
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "corpus", cache_dir=save_root)["train"]
189
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "qrels", cache_dir=save_root)["train"]
190
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "queries", cache_dir=save_root)["train"]
191
+ # ## MP-DocVQA (REALWORLD) (741) degraded realworld
192
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
193
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
194
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
195
+ # ## ArxivQA (REALWORLD) (3000) degraded realworld
196
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
197
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
198
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
199
+
200
+ # # OCR版本 (PPOCR-v3)
201
+ # ## 乾淨的PlotQA
202
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-text-v3", "corpus", cache_dir=save_root)["train"]
203
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
204
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
205
+ # ## 乾淨的SlideVQA
206
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-text-v3", "corpus", cache_dir=save_root)["train"]
207
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
208
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
209
+ # ## 乾淨的InfoVQA
210
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-text-v3", "corpus", cache_dir=save_root)["train"]
211
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
212
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
213
+ # ## 乾淨的ArxivQA
214
+ # load_dataset("rweics5cs7/exo3-original-ArxivQA-text-v3", "corpus", cache_dir=save_root)["train"]
215
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
216
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
217
+ # ## 乾淨的ChartQA
218
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-text-v3", "corpus", cache_dir=save_root)["train"]
219
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
220
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
221
+ # ## 乾淨的MP-DocVQA
222
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-v3", "corpus", cache_dir=save_root)["train"]
223
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
224
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
225
+ # ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
226
+ # load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
227
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
228
+ # load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
229
+ # ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
230
+ # load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
231
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
232
+ # load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
233
+ # ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
234
+ # load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
235
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
236
+ # load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
237
+ # ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
238
+ # load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
239
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
240
+ # load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
241
+ # ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
242
+ # load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
243
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
244
+ # load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
245
+ # ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
246
+ # load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
247
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
248
+ # load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
249
+
250
+ # ## rvl cdip (3k) 乾淨的
251
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
252
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
253
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
254
+ # ## rvl cdip (REALWORLD) (3k) degraded realworld
255
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "corpus", cache_dir=save_root)["train"]
256
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "qrels", cache_dir=save_root)["train"]
257
+ # load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "queries", cache_dir=save_root)["train"]
258
+ # ## MP-DocVQA (REALWORLD) (741) degraded realworld
259
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
260
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
261
+ # load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
262
+ # ## ArxivQA (REALWORLD) (3000) degraded realworld
263
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
264
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
265
+ # load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]