Upload folder using huggingface_hub
Browse files- Project/Doc/code.md +195 -0
- Project/Doc/data.md +390 -0
- Project/Doc/dataset.py +265 -0
Project/Doc/code.md
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
# 跑通
|
| 3 |
+
conda activate /group-volume/Human-Action-Analysis/users/hsiang.chen/envs/VisRAG
|
| 4 |
+
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
|
| 5 |
+
pip install transformers==4.40.2
|
| 6 |
+
conda install -c nvidia cuda-toolkit=11.8 cuda-nvcc=11.8
|
| 7 |
+
If pymupdf 失敗,
|
| 8 |
+
python -m pip install -U pip setuptools wheel packaging build
|
| 9 |
+
pip cache purge
|
| 10 |
+
pip install --only-binary=:all: PyMuPDF
|
| 11 |
+
安裝剩下的 requirement.txt
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
## 安裝VisRAG環境
|
| 15 |
+
安裝VisRAG.
|
| 16 |
+
```
|
| 17 |
+
git clone https://github.com/OpenBMB/VisRAG.git
|
| 18 |
+
conda create --name VisRAG python==3.10.8
|
| 19 |
+
conda activate VisRAG
|
| 20 |
+
conda install nvidia/label/cuda-11.8.0::cuda-toolkit
|
| 21 |
+
cd VisRAG
|
| 22 |
+
pip install -r requirements.txt
|
| 23 |
+
pip install -e .
|
| 24 |
+
cd timm_modified
|
| 25 |
+
pip install -e .
|
| 26 |
+
cd ..
|
| 27 |
+
```
|
| 28 |
+
0. 把原先的 "/VisRAG/src", "/VisRAG/script" 更名或刪除, 換成我寫的 "/mnt/191/a/lyw/visrag2/VisRAG/src" , "/mnt/191/a/lyw/visrag2/VisRAG/scripts"
|
| 29 |
+
1. pip install peft==0.10.0 # lora (印象太高版本無法跑, 0.10.0 還行)
|
| 30 |
+
2. pip install scikit-image==0.25.2 # degradation 會用到
|
| 31 |
+
3. pip install opencv-python==4.11.0.86 # degradation 會用到
|
| 32 |
+
|
| 33 |
+
4. 改掉 training script 跟 eval script 的路徑
|
| 34 |
+
```
|
| 35 |
+
a. 改掉 "/VisRAG/scripts/train_retriever/train.sh" 裡面的 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/lib" 跟 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/bin/python" 換成你conda的路徑.(共兩處)
|
| 36 |
+
b. 改掉 "/VisRAG/scripts/eval_retriever"的: eval.sh, eval2.sh, evalreal.sh, evalreal2.sh, evalreal3.sh 裡面的 "/mnt/VENV/user-conda/r12943109/miniconda3/envs/loramoe/bin/python"換成你的 python 路徑.(共兩處)
|
| 37 |
+
```
|
| 38 |
+
5. 去 "/VisRAG/src/openmatch/arguments.py" 改 "*cache_dir" 成自己的路徑 (可用ctrl+f 尋找 /mnt/191/ 有兩個)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
6. 下載 我們的model(VisIR)(LLM+vpm1+vpm2+zero_linear)
|
| 42 |
+
```
|
| 43 |
+
a. VisIR(乾淨未訓練, zero_linear全0): "/mnt/191/a/lyw/VisRAG/openbmb/VisIR" (整個資料夾都要)
|
| 44 |
+
b. VisIR(stage1): "/mnt/191/a/lyw/VisRAG/checkpoints/train-2025-06-19-065530-model-data-lr-1e-5-softm_temp-0.02-bsz16-ngpus4-nnodes1-inbatch--nepoch-1-pooling-wmean-attention-causal-qinstruct-true-cinstruct-false-gradcache-true-passage-stopgrad-false-npassage-1" (整個資料夾都要)
|
| 45 |
+
c. VisIR(stage1+2): "/mnt/191/a/lyw/VisRAG/checkpoints/train-2025-06-19-194437-model-data-lr-1e-5-softm_temp-0.02-bsz16-ngpus4-nnodes1-inbatch--nepoch-1-pooling-wmean-attention-causal-qinstruct-true-cinstruct-false-gradcache-true-passage-stopgrad-false-npassage-1" (整個資料夾都要)
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
## eval指令
|
| 49 |
+
eval腳本的差異如下
|
| 50 |
+
```
|
| 51 |
+
eval script 有五個, 分別對應5個測試資料集
|
| 52 |
+
1. eval.sh: 可測乾淨的 ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA
|
| 53 |
+
2. eval2.sh: 可測degraded的 ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA
|
| 54 |
+
3. evalreal.sh: 可測乾淨的 rvl-cdip
|
| 55 |
+
4. evalreal2.sh: 可測realworld的 rvl-cdip
|
| 56 |
+
5. evalreal3.sh: 可測realworld的 MP-DocVQA
|
| 57 |
+
但其實他們的差別只有在CHECKPOINT_DIR(儲存測試結果位置), CORPUS_PATH, QUERY_PATH, QRELS_PATH 而已
|
| 58 |
+
```
|
| 59 |
+
eval的使用方法
|
| 60 |
+
```
|
| 61 |
+
PYTHONNOUSERSITE=1 bash <測試的腳本路徑> 512 2048 <每台gpu測多少batchsize> <使用幾個gpu> wmean causal <測試的資料集名稱> <儲存時的子資料夾名稱(辨識用,否則都是日期)> <model dir路徑> <goat lora safetensors路徑,沒有就空著,注意這個要指到 .safetensors 不是dir>
|
| 62 |
+
```
|
| 63 |
+
eval.sh跟eval2.sh的使用方法
|
| 64 |
+
```
|
| 65 |
+
<測試的資料集名稱> 最多可以指定6個資料集 "ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA" 以逗號隔開
|
| 66 |
+
|
| 67 |
+
e.g.
|
| 68 |
+
PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
|
| 69 |
+
|
| 70 |
+
e.g.
|
| 71 |
+
PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval2.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
|
| 72 |
+
|
| 73 |
+
e.g.
|
| 74 |
+
PYTHONNOUSERSITE=1 bash scripts/eval_retriever/eval2.sh 512 2048 16 2 wmean causal ArxivQA,ChartQA,MP-DocVQA,InfoVQA,PlotQA,SlideVQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR" "/goat/lora/path/如果有的話.safetensors"
|
| 75 |
+
```
|
| 76 |
+
evalreal.sh跟evalreal2.sh跟evalreal3.sh的使用方法
|
| 77 |
+
```
|
| 78 |
+
<測試的資料集名稱> 不重要, 但至少需指定一個, 腳本不會使用這個input但會被用在取名上, 但建議是用 "ChartQA" 這個, 你打六個的話她會重複測六次
|
| 79 |
+
|
| 80 |
+
e.g.
|
| 81 |
+
PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
|
| 82 |
+
e.g.
|
| 83 |
+
PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal2.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
|
| 84 |
+
e.g.
|
| 85 |
+
PYTHONNOUSERSITE=1 bash scripts/eval_retriever/evalreal3.sh 512 2048 16 2 wmean causal ChartQA 乾淨visir測試 "/mnt/191/a/lyw/VisRAG/openbmb/VisIR"
|
| 86 |
+
```
|
| 87 |
+
eval的輔助腳本(如果忘記紀錄的���, 只要有.trec檔案就可以重新計算)
|
| 88 |
+
```
|
| 89 |
+
可使用 "/src/把visrag_eval_trec_算出三個分.py",
|
| 90 |
+
假設:
|
| 91 |
+
/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/ArxivQA
|
| 92 |
+
/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/ChartQA
|
| 93 |
+
/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/MP-DocVQA
|
| 94 |
+
/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/InfoVQA
|
| 95 |
+
/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/PlotQA
|
| 96 |
+
/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/SlideVQA
|
| 97 |
+
|
| 98 |
+
只要改腳本中的 這三個就可以
|
| 99 |
+
ROOT=/VisRAG/evaloutput/乾淨visir測試/eval-...-abc/
|
| 100 |
+
CACHE_DIR = "/path/to/save/your/datasets/"
|
| 101 |
+
要看什麼 = 'recall_10' or 要看什麼 = 'recip_rank'
|
| 102 |
+
|
| 103 |
+
=> 就可以重現結果
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
## train指令
|
| 107 |
+
train 指令跟 eval 指令不同就一個腳本而已. "/VisRAG/scripts/train_retriever/train.sh"
|
| 108 |
+
```
|
| 109 |
+
train.sh 裡面唯一需要手動改的只有 --save_steps 500. 你需要設定 你要多少steps保存一個 checkpoints
|
| 110 |
+
訓練完後他會保存在 /VisRAG/checkpoints/... 裡面
|
| 111 |
+
```
|
| 112 |
+
train 指令大致使用方法
|
| 113 |
+
```
|
| 114 |
+
"有些不需要動的我就不標示了"
|
| 115 |
+
PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
|
| 116 |
+
<每個gpu的batchsize> <gpu數量> 0.02 1 true false config/deepspeed.json \
|
| 117 |
+
1e-5 false wmean causal 1 true <minibatch> false \
|
| 118 |
+
<乾淨的模型dir> <資料集名稱> \
|
| 119 |
+
<是否要使用degradation> <是否要訓練小的資料集就好> <現在是stage1還是stage2> <有沒有使用lora>
|
| 120 |
+
|
| 121 |
+
2048是passage使用的token數量
|
| 122 |
+
0.02是訓練時的溫度, 仿visrag
|
| 123 |
+
1e-5是learning-rate, 仿visrag
|
| 124 |
+
<資料集名稱>: 就兩個 "openbmb/VisRAG-Ret-Train-In-domain-data" or "openbmb/VisRAG-Ret-Train-Synthetic-data"
|
| 125 |
+
<minibatch>: 可設1,2,3,4,..., 意思是要inference batchsize 個 sample 其實是由 minibatch 累積達到的. 設越大越耗vram
|
| 126 |
+
<是否要使用degradation>: true or false, 設true就好, 就是說在finetune的時候要不要把資料變成degraded的
|
| 127 |
+
<是否要訓練小的資料集就好>: true or false, 設false就好, 設true的話她只會train前30k筆, 可以去"/VisRAG/src/openmatch/dataset/train_dataset.py"改30k成其他數字
|
| 128 |
+
<現在是stage1還是stage2>: "stage1" or "stage2": 用來標示現在是在訓練 "stage1" 還是 "stage2" 注意不能打錯
|
| 129 |
+
<有沒有使用lora>: "None" or "任意的非None字段": 用來標示有沒有用到lora
|
| 130 |
+
```
|
| 131 |
+
stage1 訓練
|
| 132 |
+
```
|
| 133 |
+
e.g.
|
| 134 |
+
PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
|
| 135 |
+
16 2 0.02 1 true false config/deepspeed.json \
|
| 136 |
+
1e-5 false wmean causal 1 true 1 false \
|
| 137 |
+
'/mnt/191/a/lyw/VisRAG/openbmb/VisIR' 'openbmb/VisRAG-Ret-Train-In-domain-data' \
|
| 138 |
+
true false 'stage1' 'None'
|
| 139 |
+
```
|
| 140 |
+
stage2 訓練
|
| 141 |
+
```
|
| 142 |
+
e.g.
|
| 143 |
+
PYTHONNOUSERSITE=1 bash scripts/train_retriever/train.sh 2048 \
|
| 144 |
+
16 2 0.02 1 true false config/deepspeed.json \
|
| 145 |
+
1e-5 false wmean causal 1 true 1 false \
|
| 146 |
+
'/path/to/stage1/訓練完的model/dir' 'openbmb/VisRAG-Ret-Train-In-domain-data' \
|
| 147 |
+
true false 'stage2' '有用lora'
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
# 程式碼部分
|
| 152 |
+
|
| 153 |
+
## GOAT CONFIG 的部分:
|
| 154 |
+
```
|
| 155 |
+
注意我的GOAT 的 config 用幾個 experts 那些參數 是在 code 裡面做 setting 的
|
| 156 |
+
可以參考"/VisRAG/src/openmatch/modeling/dense_retrieval_model.py": build()
|
| 157 |
+
裡面有一行, goat_config = GoatV3Config(), 所以基本上目前要自己記錄下來 config 是怎麼調整
|
| 158 |
+
或者你可以去改"/VisRAG/src/openmatch/modeling/goat_injector.py" 裡面的 GoatV3Config 的 default 參數
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
## 訓練時的程式碼:
|
| 162 |
+
```
|
| 163 |
+
1. "/VisRAG/src/openmatch/driver/train.py": 主要的訓練腳本
|
| 164 |
+
2. "/VisRAG/src/openmatch/arguments.py": 想確認arguments有哪些可以去那邊看
|
| 165 |
+
3. "/VisRAG/src/openmatch/modeling/modeling_visir/modeling_vis_ir.py": 我模型的定義
|
| 166 |
+
4. "/VisRAG/src/openmatch/modeling/goat_injector.py": goat moe lora的實現
|
| 167 |
+
5. "/VisRAG/src/openmatch/dataset/train_dataset.py":line213(def get_process_fn): 那邊定義我是怎處理huggingface訓練資料集
|
| 168 |
+
6. "/VisRAG/src/openmatch/modeling/dense_retrieval_model.py": build(): 在這裡我定義我是怎麼拿到要訓練的model的
|
| 169 |
+
注意, 他是先用DRModel把 retrieval model 給包起來, 所以在程式碼中看到 model(...)的時候, 其實是由 DRModel的forward處理
|
| 170 |
+
可參考 DRModel的 forward() & encode()
|
| 171 |
+
7. "/VisRAG/src/openmatch/trainer/dense_trainer.py": _save(...): 在這邊定義我是怎麼儲存model的
|
| 172 |
+
8. "/VisRAG/src/openmatch/trainer/dense_trainer.py": training_step(...): 在這邊定義訓練model的過程
|
| 173 |
+
因為採用 minibatch 的方式, 他會先全部inference一次拿到cache, 再重新inference計算loss
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
## eval時的程式碼:
|
| 177 |
+
```
|
| 178 |
+
1. "/VisRAG/src/openmatch/driver/eval.py": 主要的推理腳本
|
| 179 |
+
2. "/VisRAG/src/openmatch/arguments.py": 想確認arguments有哪些可以去那邊看
|
| 180 |
+
eval 的時候沒用到 train_args, 可以用 if train_args: 來區分是否是訓練
|
| 181 |
+
3. "/VisRAG/src/openmatch/modeling/modeling_visir/modeling_vis_ir.py": 我模型的定義
|
| 182 |
+
4. "/VisRAG/src/openmatch/modeling/goat_injector.py": goat moe lora的實現
|
| 183 |
+
5. "/VisRAG/src/openmatch/dataset/inference_dataset.py": 在這邊跟inference處理資料有關
|
| 184 |
+
注意: 這邊我新增一個, "特殊處理用" 的 variable, 就是把 realworld datasets 全部轉正向這樣
|
| 185 |
+
6. "/VisRAG/src/openmatch/retriever/dense_retriever.py": 在這邊定義了他是怎麼先把embedding存下來, 最後再算分
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
## 模型架構圖
|
| 189 |
+
```
|
| 190 |
+
乾淨沒加可參考 "/mnt/191/a/lyw/visrag2/VisIR(模型架構).txt"
|
| 191 |
+
加了GOAT可參考 "/mnt/191/a/lyw/visrag2/VisIR+GOAT(模型架構).txt"
|
| 192 |
+
想調整GOAT架構可以改
|
| 193 |
+
"/VisRAG/src/openmatch/modeling/goat_injector.py" 裡面的 GoatV3Config 或者
|
| 194 |
+
"/VisRAG/src/openmatch/modeling/dense_retrieval_model.py" 直接在 build(...) 的程式碼裡面更改加載邏輯
|
| 195 |
+
```
|
Project/Doc/data.md
ADDED
|
@@ -0,0 +1,390 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# datasets
|
| 2 |
+
|
| 3 |
+
## load datasets(train)的方法:
|
| 4 |
+
```
|
| 5 |
+
from datasets import load_dataset
|
| 6 |
+
db = load_dataset(...)["train"]
|
| 7 |
+
for x in db:
|
| 8 |
+
# x 是一個 set{}, , e.g.
|
| 9 |
+
# {"corpus-id": "6519.png", "image": <PIL.PngImagePlugin.PngImageFile\
|
| 10 |
+
# image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
|
| 11 |
+
...
|
| 12 |
+
```
|
| 13 |
+
## load datasets(test)的方法:
|
| 14 |
+
```
|
| 15 |
+
from datasets import load_dataset
|
| 16 |
+
dbcorpus = load_dataset(..., "corpus")["train"]
|
| 17 |
+
dbqrels = load_dataset(..., "qrels")["train"]
|
| 18 |
+
dbqueries = load_dataset(..., "queries")["train"]
|
| 19 |
+
```
|
| 20 |
+
## 如果是圖片集合
|
| 21 |
+
```
|
| 22 |
+
for x in dbcorpus:
|
| 23 |
+
# x 是一個 set{}, , e.g.
|
| 24 |
+
# {"corpus-id": "圖片的id", "image": <PIL.PngImagePlugin.PngImageFile\
|
| 25 |
+
# ex. {'corpus-id': '2010.05458_3.jpg', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1122x551 at 0x7F57A3667790>}
|
| 26 |
+
...
|
| 27 |
+
for x in dbqrels:
|
| 28 |
+
# x 是一個 set{}, , e.g.
|
| 29 |
+
# {"query-id": "問題的id", "corpus-id": "圖片的id",}
|
| 30 |
+
# ex. {'query-id': '1508.06771_0.jpg-1', 'corpus-id': '1508.06771_0.jpg', 'score': 1}
|
| 31 |
+
...
|
| 32 |
+
for x in dbqueries:
|
| 33 |
+
# x 是一個 set{}, , e.g.
|
| 34 |
+
# {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
|
| 35 |
+
# ex. {'query-id': '1508.06771_0.jpg-1',
|
| 36 |
+
'query': "Which statement best describes the relationship between the components labeled 'Myosin' and 'Actin filament'?", 'answer': 'C',
|
| 37 |
+
'options': ['A) Myosin binds directly to crosslinkers.', 'B) Actin filaments are independent of myosin.', 'C) Myosin heads are bound to actin filaments.', 'D) Crosslinkers prevent the interaction between myosin and actin filaments.'], 'is_numerical': 0}
|
| 38 |
+
...
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## 如果是OCR資料集
|
| 42 |
+
```
|
| 43 |
+
for x in dbcorpus:
|
| 44 |
+
# x 是一個 set{}, , e.g.
|
| 45 |
+
# {"corpus-id": "6519.png", "text": "string to describe a photo"}
|
| 46 |
+
...
|
| 47 |
+
for x in dbqrels:
|
| 48 |
+
# x 是一個 set{}, , e.g.
|
| 49 |
+
# {"query-id": "問題的id", "corpus-id": "圖片的id",}
|
| 50 |
+
...
|
| 51 |
+
for x in dbqueries:
|
| 52 |
+
# x 是一個 set{}, , e.g.
|
| 53 |
+
# {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
|
| 54 |
+
...
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
# Dataset
|
| 58 |
+
=========================================================================
|
| 59 |
+
Training:
|
| 60 |
+
openbmb/VisRAG-Ret-Train-In-domain-data # 122,752
|
| 61 |
+
openbmb/VisRAG-Ret-Train-Synthetic-data # 239,358
|
| 62 |
+
Testing:
|
| 63 |
+
--Clean:
|
| 64 |
+
openbmb/VisRAG-Ret-Test-PlotQA (corpus, qrels, queries)
|
| 65 |
+
openbmb/VisRAG-Ret-Test-SlideVQA (corpus, qrels, queries)
|
| 66 |
+
openbmb/VisRAG-Ret-Test-InfoVQA (corpus, qrels, queries)
|
| 67 |
+
openbmb/VisRAG-Ret-Test-ArxivQA (corpus, qrels, queries)
|
| 68 |
+
openbmb/VisRAG-Ret-Test-ChartQA (corpus, qrels, queries)
|
| 69 |
+
openbmb/VisRAG-Ret-Test-MP-DocVQA (corpus, qrels, queries)
|
| 70 |
+
--Degradation Image (using clean qrels & queries)
|
| 71 |
+
rweics5cs7/exo3-original-PlotQA-deg (corpus)
|
| 72 |
+
rweics5cs7/exo3-original-SlideVQA-deg (corpus)
|
| 73 |
+
rweics5cs7/exo3-original-InfoVQA-deg (corpus)
|
| 74 |
+
rweics5cs7/exo3-original-ArxivQA-deg (corpus)
|
| 75 |
+
rweics5cs7/exo3-original-ChartQA-deg (corpus)
|
| 76 |
+
rweics5cs7/exo3-original-MP-DocVQA-deg (corpus)
|
| 77 |
+
--Real-World
|
| 78 |
+
rweics5cs7/exo7-realworld-db-combined (corpus, qrels, queries) rvl cdip (3k) 乾淨的
|
| 79 |
+
rweics5cs7/exo7-realworld-db-combined-deg (corpus, qrels, queries) rvl cdip (REALWORLD) (3k) degraded
|
| 80 |
+
rweics5cs7/exo9-realworld-db-combined (corpus, qrels, queries) MP-DocVQA (REALWORLD) (741) degraded
|
| 81 |
+
rweics5cs7/exo10-realworld-db-combined (corpus, qrels, queries) ArxivQA (REALWORLD) (3000) degraded
|
| 82 |
+
=========================================================================
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
# Train datasets:
|
| 86 |
+
## arxiv, plotqa, ... 的122k的indomain資料集
|
| 87 |
+
```
|
| 88 |
+
load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 89 |
+
```
|
| 90 |
+
## 合成的239k的資料集
|
| 91 |
+
```
|
| 92 |
+
load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 93 |
+
```
|
| 94 |
+
# Test datasets: (每個test datasets分3個split)(有圖片版本 跟 OCR版本)
|
| 95 |
+
# 圖片版本
|
| 96 |
+
## 乾淨的PlotQA
|
| 97 |
+
```
|
| 98 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 99 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 100 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 101 |
+
```
|
| 102 |
+
## 乾淨的SlideVQA
|
| 103 |
+
```
|
| 104 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 105 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 106 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 107 |
+
```
|
| 108 |
+
## 乾淨的InfoVQA
|
| 109 |
+
```
|
| 110 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 111 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 112 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 113 |
+
```
|
| 114 |
+
## 乾淨的ArxivQA
|
| 115 |
+
```
|
| 116 |
+
oad_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 117 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 118 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 119 |
+
```
|
| 120 |
+
## 乾淨的ChartQA
|
| 121 |
+
```
|
| 122 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 123 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 124 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 125 |
+
```
|
| 126 |
+
## 乾淨的MP-DocVQA
|
| 127 |
+
```
|
| 128 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 129 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 130 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 131 |
+
```
|
| 132 |
+
## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 133 |
+
```
|
| 134 |
+
load_dataset("rweics5cs7/exo3-original-PlotQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 135 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 136 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 137 |
+
```
|
| 138 |
+
## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 139 |
+
```
|
| 140 |
+
load_dataset("rweics5cs7/exo3-original-SlideVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 141 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 142 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 143 |
+
```
|
| 144 |
+
## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 145 |
+
```
|
| 146 |
+
load_dataset("rweics5cs7/exo3-original-InfoVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 147 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 148 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 149 |
+
```
|
| 150 |
+
## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 151 |
+
```
|
| 152 |
+
load_dataset("rweics5cs7/exo3-original-ArxivQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 153 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 154 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 155 |
+
```
|
| 156 |
+
## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 157 |
+
```
|
| 158 |
+
load_dataset("rweics5cs7/exo3-original-ChartQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 159 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 160 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 161 |
+
```
|
| 162 |
+
## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 163 |
+
```
|
| 164 |
+
load_dataset("rweics5cs7/exo3-original-MP-DocVQA-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 165 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 166 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 167 |
+
|
| 168 |
+
```
|
| 169 |
+
## rvl cdip (3k) 乾淨的
|
| 170 |
+
```
|
| 171 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 172 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 173 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 174 |
+
```
|
| 175 |
+
## rvl cdip (REALWORLD) (3k) degraded realworld
|
| 176 |
+
```
|
| 177 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 178 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 179 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 180 |
+
```
|
| 181 |
+
## MP-DocVQA (REALWORLD) (741) degraded realworld
|
| 182 |
+
```
|
| 183 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 184 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 185 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 186 |
+
```
|
| 187 |
+
## ArxivQA (REALWORLD) (3000) degraded realworld
|
| 188 |
+
```
|
| 189 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 190 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 191 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
# OCR版本 (PPOCR-v5)
|
| 195 |
+
## 乾淨的PlotQA
|
| 196 |
+
```
|
| 197 |
+
load_dataset("rweics5cs7/exo3-original-PlotQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 198 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 199 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 200 |
+
```
|
| 201 |
+
## 乾淨的SlideVQA
|
| 202 |
+
```
|
| 203 |
+
load_dataset("rweics5cs7/exo3-original-SlideVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 204 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 205 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 206 |
+
```
|
| 207 |
+
## 乾淨的InfoVQA
|
| 208 |
+
```
|
| 209 |
+
load_dataset("rweics5cs7/exo3-original-InfoVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 210 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 211 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 212 |
+
```
|
| 213 |
+
## 乾淨的ArxivQA
|
| 214 |
+
```
|
| 215 |
+
oad_dataset("rweics5cs7/exo3-original-ArxivQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 216 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 217 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 218 |
+
```
|
| 219 |
+
## 乾淨的ChartQA
|
| 220 |
+
```
|
| 221 |
+
load_dataset("rweics5cs7/exo3-original-ChartQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 222 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 223 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 224 |
+
```
|
| 225 |
+
## 乾淨的MP-DocVQA
|
| 226 |
+
```
|
| 227 |
+
load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 228 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 229 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 230 |
+
```
|
| 231 |
+
## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 232 |
+
```
|
| 233 |
+
load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 234 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 235 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 236 |
+
```
|
| 237 |
+
## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 238 |
+
```
|
| 239 |
+
load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 240 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 241 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 242 |
+
```
|
| 243 |
+
## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 244 |
+
```
|
| 245 |
+
load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 246 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 247 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 248 |
+
```
|
| 249 |
+
## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 250 |
+
```
|
| 251 |
+
load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 252 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 253 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 254 |
+
```
|
| 255 |
+
## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 256 |
+
```
|
| 257 |
+
load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 258 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 259 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 260 |
+
```
|
| 261 |
+
## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 262 |
+
```
|
| 263 |
+
load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 264 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 265 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 266 |
+
|
| 267 |
+
```
|
| 268 |
+
## rvl cdip (3k) 乾淨的
|
| 269 |
+
```
|
| 270 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 271 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 272 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 273 |
+
```
|
| 274 |
+
## rvl cdip (REALWORLD) (3k) degraded realworld
|
| 275 |
+
```
|
| 276 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 277 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 278 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 279 |
+
```
|
| 280 |
+
## MP-DocVQA (REALWORLD) (741) degraded realworld
|
| 281 |
+
```
|
| 282 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 283 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 284 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 285 |
+
```
|
| 286 |
+
## ArxivQA (REALWORLD) (3000) degraded realworld
|
| 287 |
+
```
|
| 288 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 289 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 290 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
# OCR版本 (PPOCR-v3)
|
| 294 |
+
## 乾淨的PlotQA
|
| 295 |
+
```
|
| 296 |
+
load_dataset("rweics5cs7/exo3-original-PlotQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 297 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 298 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 299 |
+
```
|
| 300 |
+
## 乾淨的SlideVQA
|
| 301 |
+
```
|
| 302 |
+
load_dataset("rweics5cs7/exo3-original-SlideVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 303 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 304 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 305 |
+
```
|
| 306 |
+
## 乾淨的InfoVQA
|
| 307 |
+
```
|
| 308 |
+
load_dataset("rweics5cs7/exo3-original-InfoVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 309 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 310 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 311 |
+
```
|
| 312 |
+
## 乾淨的ArxivQA
|
| 313 |
+
```
|
| 314 |
+
oad_dataset("rweics5cs7/exo3-original-ArxivQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 315 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 316 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 317 |
+
```
|
| 318 |
+
## 乾淨的ChartQA
|
| 319 |
+
```
|
| 320 |
+
load_dataset("rweics5cs7/exo3-original-ChartQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 321 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 322 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 323 |
+
```
|
| 324 |
+
## 乾淨的MP-DocVQA
|
| 325 |
+
```
|
| 326 |
+
load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 327 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 328 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 329 |
+
```
|
| 330 |
+
## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 331 |
+
```
|
| 332 |
+
load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 333 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 334 |
+
load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 335 |
+
```
|
| 336 |
+
## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 337 |
+
```
|
| 338 |
+
load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 339 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 340 |
+
load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 341 |
+
```
|
| 342 |
+
## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 343 |
+
```
|
| 344 |
+
load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 345 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 346 |
+
load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 347 |
+
```
|
| 348 |
+
## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 349 |
+
```
|
| 350 |
+
load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 351 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 352 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 353 |
+
```
|
| 354 |
+
## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 355 |
+
```
|
| 356 |
+
load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 357 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 358 |
+
load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 359 |
+
```
|
| 360 |
+
## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 361 |
+
```
|
| 362 |
+
load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 363 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 364 |
+
load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 365 |
+
|
| 366 |
+
```
|
| 367 |
+
## rvl cdip (3k) 乾淨的
|
| 368 |
+
```
|
| 369 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 370 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 371 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 372 |
+
```
|
| 373 |
+
## rvl cdip (REALWORLD) (3k) degraded realworld
|
| 374 |
+
```
|
| 375 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 376 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 377 |
+
load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 378 |
+
```
|
| 379 |
+
## MP-DocVQA (REALWORLD) (741) degraded realworld
|
| 380 |
+
```
|
| 381 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 382 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 383 |
+
load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 384 |
+
```
|
| 385 |
+
## ArxivQA (REALWORLD) (3000) degraded realworld
|
| 386 |
+
```
|
| 387 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "corpus", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 388 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "qrels", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 389 |
+
load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "queries", cache_dir="/mnt/191/a/lyw/VisRAG/Alldatasets")["train"]
|
| 390 |
+
```
|
Project/Doc/dataset.py
ADDED
|
@@ -0,0 +1,265 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# datasets
|
| 2 |
+
"""
|
| 3 |
+
# load datasets(train)的方法:
|
| 4 |
+
from datasets import load_dataset
|
| 5 |
+
db = load_dataset(...)["train"]
|
| 6 |
+
for x in db:
|
| 7 |
+
# x 是一個 set{}, , e.g.
|
| 8 |
+
# {"corpus-id": "6519.png", "image": <PIL.PngImagePlugin.PngImageFile\
|
| 9 |
+
# image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
|
| 10 |
+
...
|
| 11 |
+
## load datasets(test)的方法:
|
| 12 |
+
from datasets import load_dataset
|
| 13 |
+
dbcorpus = load_dataset(..., "corpus")["train"]
|
| 14 |
+
dbqrels = load_dataset(..., "qrels")["train"]
|
| 15 |
+
dbqueries = load_dataset(..., "queries")["train"]
|
| 16 |
+
## 如果是圖片集合
|
| 17 |
+
for x in dbcorpus:
|
| 18 |
+
# x 是一個 set{}, , e.g.
|
| 19 |
+
# {"corpus-id": "圖片的id", "image": <PIL.PngImagePlugin.PngImageFile\
|
| 20 |
+
# image mode=RGBA size=1263x700 at 0x7F0303CD6AD0>}
|
| 21 |
+
...
|
| 22 |
+
for x in dbqrels:
|
| 23 |
+
# x 是一個 set{}, , e.g.
|
| 24 |
+
# {"query-id": "問題的id", "corpus-id": "圖片的id",}
|
| 25 |
+
...
|
| 26 |
+
for x in dbqueries:
|
| 27 |
+
# x 是一個 set{}, , e.g.
|
| 28 |
+
# {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
|
| 29 |
+
...
|
| 30 |
+
## 如果是OCR資料集
|
| 31 |
+
for x in dbcorpus:
|
| 32 |
+
# x 是一個 set{}, , e.g.
|
| 33 |
+
# {"corpus-id": "6519.png", "text": "string to describe a photo"}
|
| 34 |
+
...
|
| 35 |
+
for x in dbqrels:
|
| 36 |
+
# x 是一個 set{}, , e.g.
|
| 37 |
+
# {"query-id": "問題的id", "corpus-id": "圖片的id",}
|
| 38 |
+
...
|
| 39 |
+
for x in dbqueries:
|
| 40 |
+
# x 是一個 set{}, , e.g.
|
| 41 |
+
# {"query-id": "問題的id", "query": "問題", "answer":"問題的答案"}
|
| 42 |
+
...
|
| 43 |
+
"""
|
| 44 |
+
"""
|
| 45 |
+
cd /group-volume/Behaviour-Analysis/users/hsiang.chen/Robust/scripts/ && conda activate /group-volume/Human-Action-Analysis/users/hsiang.chen/envs/autodir/ && python dataset.py
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
from datasets import load_dataset
|
| 49 |
+
|
| 50 |
+
save_root = r"/home/work/shared-fi-datasets-01/users/hsiang.chen/Project/Robust/Dataset"
|
| 51 |
+
# # Train datasets:
|
| 52 |
+
# ## arxiv, plotqa, ... 的122k的indomain資料集
|
| 53 |
+
# load_dataset("openbmb/VisRAG-Ret-Train-In-domain-data", cache_dir=save_root)["train"]
|
| 54 |
+
# ## 合成的239k的資料集
|
| 55 |
+
# load_dataset("openbmb/VisRAG-Ret-Train-Synthetic-data", cache_dir=save_root)["train"]
|
| 56 |
+
|
| 57 |
+
# # Test datasets: (每個test datasets分3個split)(有圖片版本 跟 OCR版本)
|
| 58 |
+
# # 圖片版本
|
| 59 |
+
# ## 乾淨的PlotQA
|
| 60 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "corpus", cache_dir=save_root)["train"]
|
| 61 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
|
| 62 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
|
| 63 |
+
# ## 乾淨的SlideVQA
|
| 64 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "corpus", cache_dir=save_root)["train"]
|
| 65 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
|
| 66 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
|
| 67 |
+
# ## 乾淨的InfoVQA
|
| 68 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "corpus", cache_dir=save_root)["train"]
|
| 69 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
|
| 70 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
|
| 71 |
+
# ## 乾淨的ArxivQA
|
| 72 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "corpus", cache_dir=save_root)["train"]
|
| 73 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
|
| 74 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
|
| 75 |
+
# ## 乾淨的ChartQA
|
| 76 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "corpus", cache_dir=save_root)["train"]
|
| 77 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
|
| 78 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
|
| 79 |
+
# ## 乾淨的MP-DocVQA
|
| 80 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "corpus", cache_dir=save_root)["train"]
|
| 81 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
|
| 82 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
|
| 83 |
+
# ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 84 |
+
# load_dataset("rweics5cs7/exo3-original-PlotQA-deg", "corpus", cache_dir=save_root)["train"]
|
| 85 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
|
| 86 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
|
| 87 |
+
# ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 88 |
+
# load_dataset("rweics5cs7/exo3-original-SlideVQA-deg", "corpus", cache_dir=save_root)["train"]
|
| 89 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
|
| 90 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
|
| 91 |
+
# ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 92 |
+
# load_dataset("rweics5cs7/exo3-original-InfoVQA-deg", "corpus", cache_dir=save_root)["train"]
|
| 93 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
|
| 94 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
|
| 95 |
+
# ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 96 |
+
# load_dataset("rweics5cs7/exo3-original-ArxivQA-deg", "corpus", cache_dir=save_root)["train"]
|
| 97 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
|
| 98 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
|
| 99 |
+
# ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 100 |
+
# load_dataset("rweics5cs7/exo3-original-ChartQA-deg", "corpus", cache_dir=save_root)["train"]
|
| 101 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
|
| 102 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
|
| 103 |
+
# ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 104 |
+
# load_dataset("rweics5cs7/exo3-original-MP-DocVQA-deg", "corpus", cache_dir=save_root)["train"]
|
| 105 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
|
| 106 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
|
| 107 |
+
|
| 108 |
+
## rvl cdip (3k) 乾淨的-fixed
|
| 109 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined-fixed", "corpus", cache_dir=save_root)["train"]
|
| 110 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined-fixed", "qrels", cache_dir=save_root)["train"]
|
| 111 |
+
load_dataset("rweics5cs7/exo7-realworld-db-combined-fixed", "queries", cache_dir=save_root)["train"]
|
| 112 |
+
# ## rvl cdip (3k) 乾淨的
|
| 113 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
|
| 114 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
|
| 115 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined", "queries", cache_dir=save_root)["train"]
|
| 116 |
+
# ## rvl cdip (REALWORLD) (3k) degraded realworld
|
| 117 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "corpus", cache_dir=save_root)["train"]
|
| 118 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "qrels", cache_dir=save_root)["train"]
|
| 119 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined-deg", "queries", cache_dir=save_root)["train"]
|
| 120 |
+
# ## rvl cdip (REALWORLD) (3k) degraded realworld (fixed version - 1k query)
|
| 121 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined-deg-fixed", "corpus", cache_dir=save_root)["train"]
|
| 122 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined-deg-fixed", "qrels", cache_dir=save_root)["train"]
|
| 123 |
+
# load_dataset("rweics5cs7/exo7-realworld-db-combined-deg-fixed", "queries", cache_dir=save_root)["train"]
|
| 124 |
+
# ## MP-DocVQA (REALWORLD) (741) degraded realworld
|
| 125 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
|
| 126 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
|
| 127 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined", "queries", cache_dir=save_root)["train"]
|
| 128 |
+
# ## ArxivQA (REALWORLD) (3000) degraded realworld
|
| 129 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined", "corpus", cache_dir=save_root)["train"]
|
| 130 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined", "qrels", cache_dir=save_root)["train"]
|
| 131 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined", "queries", cache_dir=save_root)["train"]
|
| 132 |
+
|
| 133 |
+
# # OCR版本 (PPOCR-v5)
|
| 134 |
+
# ## 乾淨的PlotQA
|
| 135 |
+
# load_dataset("rweics5cs7/exo3-original-PlotQA-text", "corpus", cache_dir=save_root)["train"]
|
| 136 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
|
| 137 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
|
| 138 |
+
# ## 乾淨的SlideVQA
|
| 139 |
+
# load_dataset("rweics5cs7/exo3-original-SlideVQA-text", "corpus", cache_dir=save_root)["train"]
|
| 140 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
|
| 141 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
|
| 142 |
+
# ## 乾淨的InfoVQA
|
| 143 |
+
# load_dataset("rweics5cs7/exo3-original-InfoVQA-text", "corpus", cache_dir=save_root)["train"]
|
| 144 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
|
| 145 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
|
| 146 |
+
# ## 乾淨的ArxivQA
|
| 147 |
+
# oad_dataset("rweics5cs7/exo3-original-ArxivQA-text", "corpus", cache_dir=save_root)["train"]
|
| 148 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
|
| 149 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
|
| 150 |
+
# ## 乾淨的ChartQA
|
| 151 |
+
# load_dataset("rweics5cs7/exo3-original-ChartQA-text", "corpus", cache_dir=save_root)["train"]
|
| 152 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
|
| 153 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
|
| 154 |
+
# ## 乾淨的MP-DocVQA
|
| 155 |
+
# load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text", "corpus", cache_dir=save_root)["train"]
|
| 156 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
|
| 157 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
|
| 158 |
+
# ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 159 |
+
# load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 160 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
|
| 161 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
|
| 162 |
+
# ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 163 |
+
# load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 164 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
|
| 165 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
|
| 166 |
+
# ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 167 |
+
# load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 168 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
|
| 169 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
|
| 170 |
+
# ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 171 |
+
# load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 172 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
|
| 173 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
|
| 174 |
+
# ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 175 |
+
# load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 176 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
|
| 177 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
|
| 178 |
+
# ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 179 |
+
# load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 180 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
|
| 181 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
|
| 182 |
+
|
| 183 |
+
# ## rvl cdip (3k) 乾淨的
|
| 184 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
|
| 185 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
|
| 186 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
|
| 187 |
+
# ## rvl cdip (REALWORLD) (3k) degraded realworld
|
| 188 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "corpus", cache_dir=save_root)["train"]
|
| 189 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "qrels", cache_dir=save_root)["train"]
|
| 190 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg", "queries", cache_dir=save_root)["train"]
|
| 191 |
+
# ## MP-DocVQA (REALWORLD) (741) degraded realworld
|
| 192 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
|
| 193 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
|
| 194 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
|
| 195 |
+
# ## ArxivQA (REALWORLD) (3000) degraded realworld
|
| 196 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "corpus", cache_dir=save_root)["train"]
|
| 197 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "qrels", cache_dir=save_root)["train"]
|
| 198 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined-text", "queries", cache_dir=save_root)["train"]
|
| 199 |
+
|
| 200 |
+
# # OCR版本 (PPOCR-v3)
|
| 201 |
+
# ## 乾淨的PlotQA
|
| 202 |
+
# load_dataset("rweics5cs7/exo3-original-PlotQA-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 203 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
|
| 204 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
|
| 205 |
+
# ## 乾淨的SlideVQA
|
| 206 |
+
# load_dataset("rweics5cs7/exo3-original-SlideVQA-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 207 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
|
| 208 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
|
| 209 |
+
# ## 乾淨的InfoVQA
|
| 210 |
+
# load_dataset("rweics5cs7/exo3-original-InfoVQA-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 211 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
|
| 212 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
|
| 213 |
+
# ## 乾淨的ArxivQA
|
| 214 |
+
# load_dataset("rweics5cs7/exo3-original-ArxivQA-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 215 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
|
| 216 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
|
| 217 |
+
# ## 乾淨的ChartQA
|
| 218 |
+
# load_dataset("rweics5cs7/exo3-original-ChartQA-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 219 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
|
| 220 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
|
| 221 |
+
# ## 乾淨的MP-DocVQA
|
| 222 |
+
# load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 223 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
|
| 224 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
|
| 225 |
+
# ## PlotQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 226 |
+
# load_dataset("rweics5cs7/exo3-original-PlotQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 227 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "qrels", cache_dir=save_root)["train"]
|
| 228 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-PlotQA", "queries", cache_dir=save_root)["train"]
|
| 229 |
+
# ## SlideVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 230 |
+
# load_dataset("rweics5cs7/exo3-original-SlideVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 231 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "qrels", cache_dir=save_root)["train"]
|
| 232 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-SlideVQA", "queries", cache_dir=save_root)["train"]
|
| 233 |
+
# ## InfoVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 234 |
+
# load_dataset("rweics5cs7/exo3-original-InfoVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 235 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "qrels", cache_dir=save_root)["train"]
|
| 236 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-InfoVQA", "queries", cache_dir=save_root)["train"]
|
| 237 |
+
# ## ArxivQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 238 |
+
# load_dataset("rweics5cs7/exo3-original-ArxivQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 239 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "qrels", cache_dir=save_root)["train"]
|
| 240 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ArxivQA", "queries", cache_dir=save_root)["train"]
|
| 241 |
+
# ## ChartQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 242 |
+
# load_dataset("rweics5cs7/exo3-original-ChartQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 243 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "qrels", cache_dir=save_root)["train"]
|
| 244 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-ChartQA", "queries", cache_dir=save_root)["train"]
|
| 245 |
+
# ## MP-DocVQA (degraded(synthetic)), 跟乾淨的共用 "quels" 跟 "queries"
|
| 246 |
+
# load_dataset("rweics5cs7/exo3-original-MP-DocVQA-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 247 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "qrels", cache_dir=save_root)["train"]
|
| 248 |
+
# load_dataset("openbmb/VisRAG-Ret-Test-MP-DocVQA", "queries", cache_dir=save_root)["train"]
|
| 249 |
+
|
| 250 |
+
# ## rvl cdip (3k) 乾淨的
|
| 251 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 252 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
|
| 253 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
|
| 254 |
+
# ## rvl cdip (REALWORLD) (3k) degraded realworld
|
| 255 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "corpus", cache_dir=save_root)["train"]
|
| 256 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "qrels", cache_dir=save_root)["train"]
|
| 257 |
+
# load_dataset("rweics5cs7/exo8-realworld-db-combined-text-deg-v3", "queries", cache_dir=save_root)["train"]
|
| 258 |
+
# ## MP-DocVQA (REALWORLD) (741) degraded realworld
|
| 259 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 260 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
|
| 261 |
+
# load_dataset("rweics5cs7/exo9-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
|
| 262 |
+
# ## ArxivQA (REALWORLD) (3000) degraded realworld
|
| 263 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "corpus", cache_dir=save_root)["train"]
|
| 264 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "qrels", cache_dir=save_root)["train"]
|
| 265 |
+
# load_dataset("rweics5cs7/exo10-realworld-db-combined-text-v3", "queries", cache_dir=save_root)["train"]
|