1f commited on
Commit
9cdf7a2
·
verified ·
1 Parent(s): 4b70ea9

Add files using upload-large-folder tool

Browse files
Files changed (20) hide show
  1. r1-a/response_generation/minicpm/MiniCPM-o/docs/wechat.md +6 -0
  2. r1-a/response_generation/minicpm/MiniCPM-o/docs/xinference_infer.md +67 -0
  3. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README.md +543 -0
  4. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README_zh.md +537 -0
  5. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/.env +28 -0
  6. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements.txt +30 -0
  7. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements/docs.txt +11 -0
  8. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/run.py +424 -0
  9. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/scripts/run_inference.sh +41 -0
  10. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/setup.py +122 -0
  11. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/__init__.py +16 -0
  12. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/__init__.py +5 -0
  13. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/base.py +289 -0
  14. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/gpt.py +267 -0
  15. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/config.py +20 -0
  16. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/dataset/__init__.py +237 -0
  17. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference.py +188 -0
  18. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_mt.py +182 -0
  19. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_video.py +183 -0
  20. r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/tools.py +468 -0
r1-a/response_generation/minicpm/MiniCPM-o/docs/wechat.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <img src="../assets/wechat-QR.jpeg" width="60%"/>
3
+
4
+ <p> 扫码加入「MiniCPM-o 交流群」 </p>
5
+ <p> Scan the QR code to join the "MiniCPM-o Discussion Group" </p>
6
+ </div>
r1-a/response_generation/minicpm/MiniCPM-o/docs/xinference_infer.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Xinference Infer
2
+ Xinference is a unified inference platform that provides a unified interface for different inference engines. It supports LLM, text generation, image generation, and more.but it's not bigger than Swift too much.
3
+
4
+
5
+ ### Xinference install
6
+ Xinference can be installed simply by using the following easy bash code:
7
+ ```shell
8
+ pip install "xinference[all]"
9
+ ```
10
+
11
+ ### Quick start
12
+ The initial steps for conducting inference with Xinference involve downloading the model during the first launch.
13
+ 1. Start Xinference in the terminal:
14
+ ```shell
15
+ xinference
16
+ ```
17
+ 2. Start the web ui.
18
+ 3. Search for "MiniCPM-Llama3-V-2_5" in the search box.
19
+
20
+ ![alt text](../assets/xinferenc_demo_image/xinference_search_box.png)
21
+
22
+ 4. Find and click the MiniCPM-Llama3-V-2_5 button.
23
+ 5. Follow the config and launch the model.
24
+ ```plaintext
25
+ Model engine : Transformers
26
+ model format : pytorch
27
+ Model size : 8
28
+ quantization : none
29
+ N-GPU : auto
30
+ Replica : 1
31
+ ```
32
+ 6. After first click the launch button,xinference will download the model from huggingface. We should click the webui button.
33
+
34
+ ![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png)
35
+
36
+ 7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
37
+
38
+ ### Local MiniCPM-Llama3-V-2_5 Launch
39
+ If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps:
40
+ 1. Start Xinference
41
+ ```shell
42
+ xinference
43
+ ```
44
+ 2. Start the web ui.
45
+ 3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button.
46
+
47
+ ![alt text](../assets/xinferenc_demo_image/xinference_register_model1.png)
48
+ ![alt text](../assets/xinferenc_demo_image/xinference_register_model2.png)
49
+
50
+ 4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered.
51
+ 5. Follow the config and launch the model.
52
+ ```plaintext
53
+ Model engine : Transformers
54
+ model format : pytorch
55
+ Model size : 8
56
+ quantization : none
57
+ N-GPU : auto
58
+ Replica : 1
59
+ ```
60
+ 6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button.
61
+ ![alt text](../assets/xinferenc_demo_image/xinference_webui_button.png)
62
+ 7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
63
+
64
+ ### FAQ
65
+ 1. Why can't the sixth step open the WebUI?
66
+
67
+ Maybe your firewall or mac os to prevent the web to open.
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README.md ADDED
@@ -0,0 +1,543 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluation
2
+
3
+ ## MiniCPM-o 2.6
4
+
5
+ ### opencompass
6
+ First, enter the `vlmevalkit` directory and install all dependencies:
7
+ ```bash
8
+ cd vlmevalkit
9
+ pip install --upgrade pip
10
+ pip install -e .
11
+ wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
12
+ wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
13
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
14
+ pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
15
+ pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
16
+ pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
17
+ ```
18
+ <br />
19
+
20
+ Then, run `scripts/run_inference.sh`, which receives two input parameters in sequence: `MODELNAME` and `DATALIST`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference:
21
+ ```bash
22
+ chmod +x ./scripts/run_inference.sh
23
+ ./scripts/run_inference.sh $MODELNAME $DATALIST
24
+ ```
25
+ <br />
26
+
27
+ The five available choices for `MODELNAME` are listed in `vlmeval/config.py`:
28
+ ```bash
29
+ minicpm_series = {
30
+ 'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
31
+ 'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
32
+ 'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
33
+ 'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
34
+ 'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
35
+ }
36
+ ```
37
+ <br />
38
+
39
+ All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on multiple datasets at a time, separate the names of different datasets with spaces and add quotation marks at both ends:
40
+ ```bash
41
+ $DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
42
+ ```
43
+ <br />
44
+
45
+ When the benchmark requires GPT series model for scoring, please specify `OPENAI_API_BASE` and `OPENAI_API_KEY` in the `.env` file.
46
+ In order to reproduce the results on OpenCompass benchmarks together with ChartQA and MME, which are displayed in the table on the homepage (columns between OCRBench and HallusionBench), you need to run the script according to the following settings:
47
+ ```bash
48
+ # Please note that we use different prompts for the perception and reasoning sets of MME. While evaluating on the reasoning subset, CoT is required, so you need to manually modify the judgment condition of the use_cot function in vlmeval/vlm/minicpm_v.py
49
+ ./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
50
+ ```
51
+ <br />
52
+
53
+ ### vqadataset
54
+ First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
55
+ ```bash
56
+ cd vqaeval
57
+ pip install -r requirements.txt
58
+ mkdir downloads
59
+ ```
60
+ <br />
61
+
62
+ Download the datasets from the following links and place it in the specified directories:
63
+ ###### TextVQA
64
+ ```bash
65
+ cd downloads
66
+ mkdir TextVQA && cd TextVQA
67
+ wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
68
+ unzip train_val_images.zip && rm train_val_images.zip
69
+ mv train_val_images/train_images . && rm -rf train_val_images
70
+ wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
71
+ cd ../..
72
+ ```
73
+
74
+ ###### DocVQA / DocVQATest
75
+
76
+ ```bash
77
+ cd downloads
78
+ mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
79
+ # Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
80
+ # Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
81
+ tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
82
+ unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
83
+ cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
84
+ cd ../..
85
+ ```
86
+ <br />
87
+
88
+ The `downloads` directory should be organized according to the following structure:
89
+ ```bash
90
+ downloads
91
+ ├── TextVQA
92
+ │ ├── train_images
93
+ │ │ ├── ...
94
+ │ ├── TextVQA_0.5.1_val.json
95
+ ├── DocVQA
96
+ │ ├── spdocvqa_images
97
+ │ │ ├── ...
98
+ │ ├── val_v1.0_withQT.json
99
+ │ ├── test_v1.0.json
100
+ ```
101
+ <br />
102
+
103
+ Modify the parameters in `shell/run_inference.sh` and run inference:
104
+
105
+ ```bash
106
+ chmod +x ./shell/run_inference.sh
107
+ ./shell/run_inference.sh
108
+ ```
109
+ <br />
110
+
111
+ All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
112
+ For `MiniCPM-o-2_6`, set `model_name` to `minicpmo26`:
113
+ ```bash
114
+ # path to images and their corresponding questions
115
+ # TextVQA
116
+ --textVQA_image_dir
117
+ --textVQA_ann_path
118
+ # DocVQA
119
+ --docVQA_image_dir
120
+ --docVQA_ann_path
121
+ # DocVQATest
122
+ --docVQATest_image_dir
123
+ --docVQATest_ann_path
124
+
125
+ # whether to eval on certain task
126
+ --eval_textVQA
127
+ --eval_docVQA
128
+ --eval_docVQATest
129
+ --eval_all
130
+
131
+ # model name and model path
132
+ --model_name
133
+ --model_path
134
+ # load model from ckpt
135
+ --ckpt
136
+ # the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
137
+ --generate_method
138
+
139
+ --batchsize
140
+
141
+ # path to save the outputs
142
+ --answer_path
143
+ ```
144
+ <br />
145
+
146
+ While evaluating on different tasks, parameters need to be set as follows:
147
+ ###### TextVQA
148
+ ```bash
149
+ --eval_textVQA
150
+ --textVQA_image_dir ./downloads/TextVQA/train_images
151
+ --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
152
+ ```
153
+
154
+ ###### DocVQA
155
+ ```bash
156
+ --eval_docVQA
157
+ --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
158
+ --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
159
+ ```
160
+
161
+ ###### DocVQATest
162
+ ```bash
163
+ --eval_docVQATest
164
+ --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
165
+ --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
166
+ ```
167
+
168
+ <br />
169
+
170
+ For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
171
+ ```bash
172
+ chmod +x ./shell/run_transform.sh
173
+ ./shell/run_transform.sh
174
+ ```
175
+
176
+ <br />
177
+
178
+ ## MiniCPM-V 2.6
179
+
180
+ <details>
181
+ <summary>Expand</summary>
182
+
183
+ ### opencompass
184
+ First, enter the `vlmevalkit` directory and install all dependencies:
185
+ ```bash
186
+ cd vlmevalkit
187
+ pip install --upgrade pip
188
+ pip install -e .
189
+ wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
190
+ wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
191
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
192
+ pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
193
+ pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
194
+ pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
195
+ ```
196
+ <br />
197
+
198
+ Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
199
+ ```bash
200
+ chmod +x ./scripts/run_inference.sh
201
+ ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
202
+ ```
203
+ <br />
204
+
205
+ The four available choices for `MODELNAME` are listed in `vlmeval/config.py`:
206
+ ```bash
207
+ minicpm_series = {
208
+ 'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
209
+ 'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
210
+ 'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
211
+ 'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
212
+ }
213
+ ```
214
+ <br />
215
+
216
+ All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. Separate the names of different datasets with spaces and add quotation marks at both ends:
217
+ ```bash
218
+ $DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
219
+ ```
220
+ <br />
221
+
222
+ While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and HallusionBench), you need to run the script according to the following settings:
223
+ ```bash
224
+ # without CoT
225
+ ./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
226
+ ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
227
+ # with CoT
228
+ # While running the CoT version of MME, you need to modify the 'use_cot' function in vlmeval/vlm/minicpm_v.py and add MME to the branch that returns True.
229
+ ./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
230
+ ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
231
+ ```
232
+ <br />
233
+
234
+ ### vqadataset
235
+ First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
236
+ ```bash
237
+ cd vqaeval
238
+ pip install -r requirements.txt
239
+ mkdir downloads
240
+ ```
241
+ <br />
242
+
243
+ Download the datasets from the following links and place it in the specified directories:
244
+ ###### TextVQA
245
+ ```bash
246
+ cd downloads
247
+ mkdir TextVQA && cd TextVQA
248
+ wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
249
+ unzip train_val_images.zip && rm train_val_images.zip
250
+ mv train_val_images/train_images . && rm -rf train_val_images
251
+ wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
252
+ cd ../..
253
+ ```
254
+
255
+ ###### DocVQA / DocVQATest
256
+
257
+ ```bash
258
+ cd downloads
259
+ mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
260
+ # Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
261
+ # Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
262
+ tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
263
+ unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
264
+ cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
265
+ cd ../..
266
+ ```
267
+ <br />
268
+
269
+ The `downloads` directory should be organized according to the following structure:
270
+ ```bash
271
+ downloads
272
+ ├── TextVQA
273
+ │ ├── train_images
274
+ │ │ ├── ...
275
+ │ ├── TextVQA_0.5.1_val.json
276
+ ├── DocVQA
277
+ │ ├── spdocvqa_images
278
+ │ │ ├── ...
279
+ │ ├── val_v1.0_withQT.json
280
+ │ ├── test_v1.0.json
281
+ ```
282
+ <br />
283
+
284
+ Modify the parameters in `shell/run_inference.sh` and run inference:
285
+
286
+ ```bash
287
+ chmod +x ./shell/run_inference.sh
288
+ ./shell/run_inference.sh
289
+ ```
290
+ <br />
291
+
292
+ All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
293
+ For `MiniCPM-V-2_6`, set `model_name` to `minicpmv26`:
294
+ ```bash
295
+ # path to images and their corresponding questions
296
+ # TextVQA
297
+ --textVQA_image_dir
298
+ --textVQA_ann_path
299
+ # DocVQA
300
+ --docVQA_image_dir
301
+ --docVQA_ann_path
302
+ # DocVQATest
303
+ --docVQATest_image_dir
304
+ --docVQATest_ann_path
305
+
306
+ # whether to eval on certain task
307
+ --eval_textVQA
308
+ --eval_docVQA
309
+ --eval_docVQATest
310
+ --eval_all
311
+
312
+ # model name and model path
313
+ --model_name
314
+ --model_path
315
+ # load model from ckpt
316
+ --ckpt
317
+ # the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
318
+ --generate_method
319
+
320
+ --batchsize
321
+
322
+ # path to save the outputs
323
+ --answer_path
324
+ ```
325
+ <br />
326
+
327
+ While evaluating on different tasks, parameters need to be set as follows:
328
+ ###### TextVQA
329
+ ```bash
330
+ --eval_textVQA
331
+ --textVQA_image_dir ./downloads/TextVQA/train_images
332
+ --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
333
+ ```
334
+
335
+ ###### DocVQA
336
+ ```bash
337
+ --eval_docVQA
338
+ --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
339
+ --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
340
+ ```
341
+
342
+ ###### DocVQATest
343
+ ```bash
344
+ --eval_docVQATest
345
+ --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
346
+ --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
347
+ ```
348
+
349
+ <br />
350
+
351
+ For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
352
+ ```bash
353
+ chmod +x ./shell/run_transform.sh
354
+ ./shell/run_transform.sh
355
+ ```
356
+
357
+ </details>
358
+
359
+ <br />
360
+
361
+ ## MiniCPM-Llama3-V-2_5
362
+
363
+ <details>
364
+ <summary>Expand</summary>
365
+
366
+ ### opencompass
367
+ First, enter the `vlmevalkit` directory and install all dependencies:
368
+ ```bash
369
+ cd vlmevalkit
370
+ pip install -r requirements.txt
371
+ ```
372
+ <br />
373
+
374
+ Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
375
+ ```bash
376
+ chmod +x ./scripts/run_inference.sh
377
+ ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
378
+ ```
379
+ <br />
380
+
381
+ The three available choices for `MODELNAME` are listed in `vlmeval/config.py`:
382
+ ```bash
383
+ ungrouped = {
384
+ 'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
385
+ 'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
386
+ 'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
387
+ }
388
+ ```
389
+ <br />
390
+
391
+ All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
392
+ ```bash
393
+ $DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
394
+ ```
395
+ <br />
396
+
397
+ While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
398
+ ```bash
399
+ # run on all 7 datasets
400
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
401
+
402
+ # The following are instructions for running on a single dataset
403
+ # MME
404
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
405
+ # MMBench_TEST_EN
406
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
407
+ # MMBench_TEST_CN
408
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
409
+ # MMMU_DEV_VAL
410
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
411
+ # MathVista_MINI
412
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
413
+ # LLaVABench
414
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
415
+ # RealWorldQA
416
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
417
+ ```
418
+ <br />
419
+
420
+ ### vqadataset
421
+ First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
422
+ ```bash
423
+ cd vqaeval
424
+ pip install -r requirements.txt
425
+ mkdir downloads
426
+ ```
427
+ <br />
428
+
429
+ Download the datasets from the following links and place it in the specified directories:
430
+ ###### TextVQA
431
+ ```bash
432
+ cd downloads
433
+ mkdir TextVQA && cd TextVQA
434
+ wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
435
+ unzip train_val_images.zip && rm train_val_images.zip
436
+ mv train_val_images/train_images . && rm -rf train_val_images
437
+ wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
438
+ cd ../..
439
+ ```
440
+
441
+ ###### DocVQA / DocVQATest
442
+
443
+ ```bash
444
+ cd downloads
445
+ mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
446
+ # Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
447
+ # Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
448
+ tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
449
+ unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
450
+ cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
451
+ cd ../..
452
+ ```
453
+ <br />
454
+
455
+ The `downloads` directory should be organized according to the following structure:
456
+ ```bash
457
+ downloads
458
+ ├── TextVQA
459
+ │ ├── train_images
460
+ │ │ ├── ...
461
+ │ ├── TextVQA_0.5.1_val.json
462
+ ├── DocVQA
463
+ │ ├── spdocvqa_images
464
+ │ │ ├── ...
465
+ │ ├── val_v1.0_withQT.json
466
+ │ ├── test_v1.0.json
467
+ ```
468
+ <br />
469
+
470
+ Modify the parameters in `shell/run_inference.sh` and run inference:
471
+
472
+ ```bash
473
+ chmod +x ./shell/run_inference.sh
474
+ ./shell/run_inference.sh
475
+ ```
476
+ <br />
477
+
478
+ All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
479
+ For `MiniCPM-Llama3-V-2_5`, set `model_name` to `minicpmv`:
480
+ ```bash
481
+ # path to images and their corresponding questions
482
+ # TextVQA
483
+ --textVQA_image_dir
484
+ --textVQA_ann_path
485
+ # DocVQA
486
+ --docVQA_image_dir
487
+ --docVQA_ann_path
488
+ # DocVQATest
489
+ --docVQATest_image_dir
490
+ --docVQATest_ann_path
491
+
492
+ # whether to eval on certain task
493
+ --eval_textVQA
494
+ --eval_docVQA
495
+ --eval_docVQATest
496
+ --eval_all
497
+
498
+ # model name and model path
499
+ --model_name
500
+ --model_path
501
+ # load model from ckpt
502
+ --ckpt
503
+ # the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
504
+ --generate_method
505
+
506
+ --batchsize
507
+
508
+ # path to save the outputs
509
+ --answer_path
510
+ ```
511
+ <br />
512
+
513
+ While evaluating on different tasks, parameters need to be set as follows:
514
+ ###### TextVQA
515
+ ```bash
516
+ --eval_textVQA
517
+ --textVQA_image_dir ./downloads/TextVQA/train_images
518
+ --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
519
+ ```
520
+
521
+ ###### DocVQA
522
+ ```bash
523
+ --eval_docVQA
524
+ --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
525
+ --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
526
+ ```
527
+
528
+ ###### DocVQATest
529
+ ```bash
530
+ --eval_docVQATest
531
+ --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
532
+ --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
533
+ ```
534
+
535
+ <br />
536
+
537
+ For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
538
+ ```bash
539
+ chmod +x ./shell/run_transform.sh
540
+ ./shell/run_transform.sh
541
+ ```
542
+
543
+ </details>
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README_zh.md ADDED
@@ -0,0 +1,537 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluation
2
+
3
+ ## MiniCPM-o 2.6
4
+
5
+ ### opencompass
6
+ 首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
7
+ ```bash
8
+ cd vlmevalkit
9
+ pip install --upgrade pip
10
+ pip install -e .
11
+ wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
12
+ wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
13
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
14
+ pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
15
+ pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
16
+ pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
17
+ rm *.whl
18
+ ```
19
+ <br />
20
+
21
+ 然后,运行 `scripts/run_inference.sh`,该脚本依次接收两个输入参数:`MODELNAME`, `DATALIST`。其中,`MODELNAME` 为模型名称,`DATALIST` 为目标数据集。
22
+ ```bash
23
+ chmod +x ./scripts/run_inference.sh
24
+ ./scripts/run_inference.sh $MODELNAME $DATALIST
25
+ ```
26
+ <br />
27
+
28
+ `MODELNAME` 有五种选择,位于 `vlmeval/config.py` 中:
29
+ ```bash
30
+ minicpm_series = {
31
+ 'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
32
+ 'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
33
+ 'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
34
+ 'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
35
+ 'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
36
+ }
37
+ ```
38
+ <br />
39
+
40
+ 可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。一次评测多个数据集时,将不同数据集名称以空格隔开,两端加引号:
41
+ ```bash
42
+ $DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST"
43
+ ```
44
+ <br />
45
+
46
+ 当评测的 benchmark 需要 GPT 系列模型进行评分时,请在 `.env` 文件中预先指定 `OPENAI_API_BASE` 和 `OPENAI_API_KEY`。
47
+ 为了复现出首页展示的表格中 OpenCompass 对应的各项数据集以及 ChartQA 和 MME 上的结果(OCRBench 到 HallusionBench 之间的列),需要按照如下设置运行:
48
+ ```bash
49
+ # 请注意,对于 MME 的 perception 和 reasoning 集,我们采取了不同的 prompt 方式。评测 reasoning 子集时,需要使用 CoT,因此需要手动到 vlmeval/vlm/minicpm_v.py 中修改 use_cot 函数的判断条件
50
+ ./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
51
+ ```
52
+ <br />
53
+
54
+ ### vqadataset
55
+ 首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
56
+ ```bash
57
+ cd vqaeval
58
+ pip install -r requirements.txt
59
+ mkdir downloads
60
+ ```
61
+ <br />
62
+
63
+ 然后,从下列各地址下载数据集并置于指定目录下:
64
+ ###### TextVQA
65
+ ```bash
66
+ cd downloads
67
+ mkdir TextVQA && cd TextVQA
68
+ wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
69
+ unzip train_val_images.zip && rm train_val_images.zip
70
+ mv train_val_images/train_images . && rm -rf train_val_images
71
+ wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
72
+ cd ../..
73
+ ```
74
+
75
+ ###### DocVQA / DocVQATest
76
+ ```bash
77
+ cd downloads
78
+ mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
79
+ # 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
80
+ # 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
81
+ tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
82
+ unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
83
+ cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
84
+ cd ../..
85
+ ```
86
+ <br />
87
+
88
+ `downloads` 目录应当按照下列结构组织:
89
+ ```bash
90
+ downloads
91
+ ├── TextVQA
92
+ │ ├── train_images
93
+ │ │ ├── ...
94
+ │ ├── TextVQA_0.5.1_val.json
95
+ ├── DocVQA
96
+ │ ├── spdocvqa_images
97
+ │ │ ├── ...
98
+ │ ├── val_v1.0_withQT.json
99
+ │ ├── test_v1.0.json
100
+ ```
101
+ <br />
102
+
103
+ 准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
104
+
105
+ ```bash
106
+ chmod +x ./shell/run_inference.sh
107
+ ./shell/run_inference.sh
108
+ ```
109
+ <br />
110
+
111
+ 可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
112
+ 对于 `MiniCPM-o-2_6`,需要将 `model_name`设置为 `minicpmo26`:
113
+ ```bash
114
+ # 指定 TextVQA 评测所有图片和问题的路径
115
+ --textVQA_image_dir
116
+ --textVQA_ann_path
117
+ # 指定 DocVQA 评测所有图片和问题的路径
118
+ --docVQA_image_dir
119
+ --docVQA_ann_path
120
+ # 指定 DocVQATest 评测所有图片和问题的路径
121
+ --docVQATest_image_dir
122
+ --docVQATest_ann_path
123
+
124
+ # 决定是否评测某��任务,eval_all 设置为 True 表示所有任务都评测
125
+ --eval_textVQA
126
+ --eval_docVQA
127
+ --eval_docVQATest
128
+ --eval_all
129
+
130
+ # 模型名称、模型路径(从指定路径加载模型)
131
+ --model_name
132
+ --model_path
133
+ # 从 checkpoint 加载模型
134
+ --ckpt
135
+ # 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
136
+ --generate_method
137
+ # 推理时的批处理规模,建议推理时设置为 1
138
+ --batchsize
139
+
140
+ # 输出内容保存的路径
141
+ --answer_path
142
+ ```
143
+ <br />
144
+
145
+ 评测三个任务需要设置的参数如下:
146
+ ###### TextVQA
147
+ ```bash
148
+ --eval_textVQA
149
+ --textVQA_image_dir ./downloads/TextVQA/train_images
150
+ --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
151
+ ```
152
+
153
+ ###### DocVQA
154
+ ```bash
155
+ --eval_docVQA
156
+ --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
157
+ --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
158
+ ```
159
+
160
+ ###### DocVQATest
161
+ ```bash
162
+ --eval_docVQATest
163
+ --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
164
+ --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
165
+ ```
166
+ <br />
167
+
168
+ 对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
169
+ ```bash
170
+ chmod +x ./shell/run_transform.sh
171
+ ./shell/run_transform.sh
172
+ ```
173
+
174
+ <br />
175
+
176
+ ## MiniCPM-V 2.6
177
+
178
+ <details>
179
+ <summary>展开</summary>
180
+
181
+ ### opencompass
182
+ 首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
183
+ ```bash
184
+ cd vlmevalkit
185
+ pip install --upgrade pip
186
+ pip install -e .
187
+ wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
188
+ wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
189
+ wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
190
+ pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
191
+ pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
192
+ pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
193
+ rm *.whl
194
+ ```
195
+ <br />
196
+
197
+ 然后,运行 `scripts/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
198
+ ```bash
199
+ chmod +x ./scripts/run_inference.sh
200
+ ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
201
+ ```
202
+ <br />
203
+
204
+ `MODELNAME` 有四种选择,位于 `vlmeval/config.py` 中:
205
+ ```bash
206
+ minicpm_series = {
207
+ 'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
208
+ 'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
209
+ 'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
210
+ 'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
211
+ }
212
+ ```
213
+ <br />
214
+
215
+ 可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。将不同数据集名称以空格隔开,两端加引号:
216
+ ```bash
217
+ $DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
218
+ ```
219
+ <br />
220
+
221
+ 直接对各 benchmark 进行评分时,设置 `MODE=all`。如果仅需要推理结果,则设置 `MODE=infer`。
222
+ 为了复现出首页展示的表格中的各项结果(MME 到 HallusionBench 之间的列),需要按照如下设置运行:
223
+ ```bash
224
+ # without CoT
225
+ ./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
226
+ ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
227
+ # with CoT,运行 CoT 版本的 MME 时,需要改写 vlmeval/vlm/minicpm_v.py 中的 'use_cot' 函数,将 MME 添加到 return True 的分支中
228
+ ./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
229
+ ./scripts/run_inference.sh MiniCPM-V-2_6 MME all
230
+ ```
231
+ <br />
232
+
233
+ ### vqadataset
234
+ 首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
235
+ ```bash
236
+ cd vqaeval
237
+ pip install -r requirements.txt
238
+ mkdir downloads
239
+ ```
240
+ <br />
241
+
242
+ 然后,从下列各地址下载数据集并置于指定目录下:
243
+ ###### TextVQA
244
+ ```bash
245
+ cd downloads
246
+ mkdir TextVQA && cd TextVQA
247
+ wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
248
+ unzip train_val_images.zip && rm train_val_images.zip
249
+ mv train_val_images/train_images . && rm -rf train_val_images
250
+ wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
251
+ cd ../..
252
+ ```
253
+
254
+ ###### DocVQA / DocVQATest
255
+ ```bash
256
+ cd downloads
257
+ mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
258
+ # 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images ��� Annotations
259
+ # 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
260
+ tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
261
+ unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
262
+ cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
263
+ cd ../..
264
+ ```
265
+ <br />
266
+
267
+ `downloads` 目录应当按照下列结构组织:
268
+ ```bash
269
+ downloads
270
+ ├── TextVQA
271
+ │ ├── train_images
272
+ │ │ ├── ...
273
+ │ ├── TextVQA_0.5.1_val.json
274
+ ├── DocVQA
275
+ │ ├── spdocvqa_images
276
+ │ │ ├── ...
277
+ │ ├── val_v1.0_withQT.json
278
+ │ ├── test_v1.0.json
279
+ ```
280
+ <br />
281
+
282
+ 准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
283
+
284
+ ```bash
285
+ chmod +x ./shell/run_inference.sh
286
+ ./shell/run_inference.sh
287
+ ```
288
+ <br />
289
+
290
+ 可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
291
+ 对于 `MiniCPM-V-2_6`,需要将 `model_name`设置为 `minicpmv26`:
292
+ ```bash
293
+ # 指定 TextVQA 评测所有图片和问题的路径
294
+ --textVQA_image_dir
295
+ --textVQA_ann_path
296
+ # 指定 DocVQA 评测所有图片和问题的路径
297
+ --docVQA_image_dir
298
+ --docVQA_ann_path
299
+ # 指定 DocVQATest 评测所有图片和问题的路径
300
+ --docVQATest_image_dir
301
+ --docVQATest_ann_path
302
+
303
+ # 决定是否评测某个任务,eval_all 设置为 True 表示所有任务都评测
304
+ --eval_textVQA
305
+ --eval_docVQA
306
+ --eval_docVQATest
307
+ --eval_all
308
+
309
+ # 模型名称、模型路径(从指定路径加载模型)
310
+ --model_name
311
+ --model_path
312
+ # 从 checkpoint 加载模型
313
+ --ckpt
314
+ # 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
315
+ --generate_method
316
+ # 推理时的批处理规模,建议推理时设置为 1
317
+ --batchsize
318
+
319
+ # 输出内容保存的路径
320
+ --answer_path
321
+ ```
322
+ <br />
323
+
324
+ 评测三个任务需要设置的参数如下:
325
+ ###### TextVQA
326
+ ```bash
327
+ --eval_textVQA
328
+ --textVQA_image_dir ./downloads/TextVQA/train_images
329
+ --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
330
+ ```
331
+
332
+ ###### DocVQA
333
+ ```bash
334
+ --eval_docVQA
335
+ --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
336
+ --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
337
+ ```
338
+
339
+ ###### DocVQATest
340
+ ```bash
341
+ --eval_docVQATest
342
+ --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
343
+ --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
344
+ ```
345
+ <br />
346
+
347
+ 对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
348
+ ```bash
349
+ chmod +x ./shell/run_transform.sh
350
+ ./shell/run_transform.sh
351
+ ```
352
+
353
+ </details>
354
+
355
+ <br />
356
+
357
+ ## MiniCPM-Llama3-V-2_5
358
+
359
+ <details>
360
+ <summary>展开</summary>
361
+
362
+ ### opencompass
363
+ 首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
364
+ ```bash
365
+ cd vlmevalkit
366
+ pip install -r requirements.txt
367
+ ```
368
+ <br />
369
+
370
+ 然后,运行 `scripts/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
371
+ ```bash
372
+ chmod +x ./scripts/run_inference.sh
373
+ ./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
374
+ ```
375
+ <br />
376
+
377
+ `MODELNAME` 有三种选择,位于 `vlmeval/config.py` 中:
378
+ ```bash
379
+ ungrouped = {
380
+ 'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
381
+ 'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
382
+ 'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
383
+ }
384
+ ```
385
+ <br />
386
+
387
+ 可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中,评测单个数据集时,直接调用数据集名称,不加引号;评测多个数据集时,将不同数据集名称以空格隔开,两端加引号:
388
+ ```bash
389
+ $DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
390
+ ```
391
+ <br />
392
+
393
+ 直接对各 benchmark 进行评分时,设置 `MODE=all`。如果仅需要推理结果,则设置 `MODE=infer`
394
+ 为了复现出首页展示的表格中的各项结果(MME 到 RealWorldQA 之间的列),需要按照如下设置运行:
395
+ ```bash
396
+ # 一次性运行 7 个数据集
397
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
398
+
399
+ # 以下是单独运行 1 个数据集的指令
400
+ # MME
401
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
402
+ # MMBench_TEST_EN
403
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
404
+ # MMBench_TEST_CN
405
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
406
+ # MMMU_DEV_VAL
407
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
408
+ # MathVista_MINI
409
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
410
+ # LLaVABench
411
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
412
+ # RealWorldQA
413
+ ./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
414
+ ```
415
+ <br />
416
+
417
+ ### vqadataset
418
+ 首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
419
+ ```bash
420
+ cd vqaeval
421
+ pip install -r requirements.txt
422
+ mkdir downloads
423
+ ```
424
+ <br />
425
+
426
+ 然后,从下列各地址下载数据集并置于指定目录下:
427
+ ###### TextVQA
428
+ ```bash
429
+ cd downloads
430
+ mkdir TextVQA && cd TextVQA
431
+ wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
432
+ unzip train_val_images.zip && rm train_val_images.zip
433
+ mv train_val_images/train_images . && rm -rf train_val_images
434
+ wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
435
+ cd ../..
436
+ ```
437
+
438
+ ###### DocVQA / DocVQATest
439
+ ```bash
440
+ cd downloads
441
+ mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
442
+ # 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
443
+ # 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
444
+ tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
445
+ unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
446
+ cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
447
+ cd ../..
448
+ ```
449
+ <br />
450
+
451
+ `downloads` 目录应当按照下列结构组织:
452
+ ```bash
453
+ downloads
454
+ ├── TextVQA
455
+ │ ├── train_images
456
+ │ │ ├── ...
457
+ │ ├── TextVQA_0.5.1_val.json
458
+ ├── DocVQA
459
+ │ ├── spdocvqa_images
460
+ │ │ ├── ...
461
+ │ ├── val_v1.0_withQT.json
462
+ │ ├── test_v1.0.json
463
+ ```
464
+ <br />
465
+
466
+ 准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
467
+
468
+ ```bash
469
+ chmod +x ./shell/run_inference.sh
470
+ ./shell/run_inference.sh
471
+ ```
472
+ <br />
473
+
474
+ 可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
475
+ 对于 `MiniCPM-Llama3-V-2_5`,需要将 `model_name` 设置为 `minicpmv`:
476
+ ```bash
477
+ # 指定 TextVQA 评测所有图片和问题的路径
478
+ --textVQA_image_dir
479
+ --textVQA_ann_path
480
+ # 指定 DocVQA 评测所有图片和问题的路径
481
+ --docVQA_image_dir
482
+ --docVQA_ann_path
483
+ # 指定 DocVQATest 评测所有图片和问题的路径
484
+ --docVQATest_image_dir
485
+ --docVQATest_ann_path
486
+
487
+ # 决定是否评测某个任务,eval_all 设置为 True 表示所有任务都评测
488
+ --eval_textVQA
489
+ --eval_docVQA
490
+ --eval_docVQATest
491
+ --eval_all
492
+
493
+ # 模型名称、模型路径(从指定路径加载模型)
494
+ --model_name
495
+ --model_path
496
+ # 从 checkpoint 加载模型
497
+ --ckpt
498
+ # 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
499
+ --generate_method
500
+ # 推理时的批处理规模,建议推理时设置为 1
501
+ --batchsize
502
+
503
+ # 输出内容保存的路径
504
+ --answer_path
505
+ ```
506
+ <br />
507
+
508
+ 评测三个任务需要设置的参数如下:
509
+ ###### TextVQA
510
+ ```bash
511
+ --eval_textVQA
512
+ --textVQA_image_dir ./downloads/TextVQA/train_images
513
+ --textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
514
+ ```
515
+
516
+ ###### DocVQA
517
+ ```bash
518
+ --eval_docVQA
519
+ --docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
520
+ --docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
521
+ ```
522
+
523
+ ###### DocVQATest
524
+ ```bash
525
+ --eval_docVQATest
526
+ --docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
527
+ --docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
528
+ ```
529
+ <br />
530
+
531
+ 对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
532
+ ```bash
533
+ chmod +x ./shell/run_transform.sh
534
+ ./shell/run_transform.sh
535
+ ```
536
+
537
+ </details>
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/.env ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # .env 文件,将其放置在 $VLMEvalKit 下
2
+ # 专有 VLMs 的 API 密钥
3
+ # QwenVL APIs
4
+ DASHSCOPE_API_KEY=
5
+ # Gemini w. Google Cloud Backends
6
+ GOOGLE_API_KEY=
7
+ # OpenAI API
8
+ OPENAI_API_KEY=
9
+ OPENAI_API_BASE=
10
+ # StepAI API
11
+ STEPAI_API_KEY=
12
+ # REKA API
13
+ REKA_API_KEY=
14
+ # GLMV API
15
+ GLMV_API_KEY=
16
+ # CongRong API
17
+ CW_API_BASE=
18
+ CW_API_KEY=
19
+ # SenseChat-V API
20
+ SENSECHAT_AK=
21
+ SENSECHAT_SK=
22
+ # Hunyuan-Vision API
23
+ HUNYUAN_SECRET_KEY=
24
+ HUNYUAN_SECRET_ID=
25
+ # LMDeploy API
26
+ LMDEPLOY_API_BASE=
27
+ # 你可以设置一个评估时代理,评估阶段产生的 API 调用将通过这个代理进行
28
+ EVAL_PROXY=
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ decord; platform_machine != 'arm64'
2
+ eva-decord; platform_machine == 'arm64'
3
+ gradio
4
+ huggingface_hub
5
+ imageio
6
+ matplotlib
7
+ numpy
8
+ omegaconf
9
+ openai
10
+ opencv-python>=4.4.0.46
11
+ openpyxl
12
+ pandas
13
+ pillow
14
+ portalocker
15
+ protobuf
16
+ python-dotenv
17
+ requests
18
+ rich
19
+ sentencepiece
20
+ setuptools
21
+ sty
22
+ tabulate
23
+ tiktoken
24
+ timeout-decorator
25
+ torch
26
+ tqdm
27
+ transformers
28
+ typing_extensions
29
+ validators
30
+ xlsxwriter
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements/docs.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ docutils==0.18.1
2
+ modelindex
3
+ myst-parser
4
+ -e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
5
+ sphinx==6.1.3
6
+ sphinx-copybutton
7
+ sphinx-design
8
+ sphinx-notfound-page
9
+ sphinx-tabs
10
+ sphinxcontrib-jquery
11
+ tabulate
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/run.py ADDED
@@ -0,0 +1,424 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.distributed as dist
3
+
4
+ from vlmeval.config import supported_VLM
5
+ from vlmeval.dataset.video_dataset_config import supported_video_datasets
6
+ from vlmeval.dataset import build_dataset
7
+ from vlmeval.inference import infer_data_job
8
+ from vlmeval.inference_video import infer_data_job_video
9
+ from vlmeval.inference_mt import infer_data_job_mt
10
+ from vlmeval.smp import *
11
+ from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
12
+
13
+
14
+ def build_model_from_config(cfg, model_name):
15
+ import vlmeval.api
16
+ import vlmeval.vlm
17
+ config = cp.deepcopy(cfg[model_name])
18
+ if config == {}:
19
+ return supported_VLM[model_name]()
20
+ assert 'class' in config
21
+ cls_name = config.pop('class')
22
+ if hasattr(vlmeval.api, cls_name):
23
+ return getattr(vlmeval.api, cls_name)(**config)
24
+ elif hasattr(vlmeval.vlm, cls_name):
25
+ return getattr(vlmeval.vlm, cls_name)(**config)
26
+ else:
27
+ raise ValueError(f'Class {cls_name} is not supported in `vlmeval.api` or `vlmeval.vlm`')
28
+
29
+
30
+ def build_dataset_from_config(cfg, dataset_name):
31
+ import vlmeval.dataset
32
+ import inspect
33
+ config = cp.deepcopy(cfg[dataset_name])
34
+ if config == {}:
35
+ return supported_video_datasets[dataset_name]()
36
+ assert 'class' in config
37
+ cls_name = config.pop('class')
38
+ if hasattr(vlmeval.dataset, cls_name):
39
+ cls = getattr(vlmeval.dataset, cls_name)
40
+ sig = inspect.signature(cls.__init__)
41
+ valid_params = {k: v for k, v in config.items() if k in sig.parameters}
42
+ if valid_params.get('fps', 0) > 0 and valid_params.get('nframe', 0) > 0:
43
+ raise ValueError('fps and nframe should not be set at the same time')
44
+ if valid_params.get('fps', 0) <= 0 and valid_params.get('nframe', 0) <= 0:
45
+ raise ValueError('fps and nframe should be set at least one valid value')
46
+ return cls(**valid_params)
47
+ else:
48
+ raise ValueError(f'Class {cls_name} is not supported in `vlmeval.dataset`')
49
+
50
+
51
+ def parse_args():
52
+ help_msg = """\
53
+ You can launch the evaluation by setting either --data and --model or --config.
54
+
55
+ --data and --model:
56
+ Each Arg should be a list of strings, specifying the names of datasets and models.
57
+ To find all supported model names, please refer to the `vlmeval/config.py` of check the output of the command \
58
+ `vlmutil mlist all` in the terminal (you should first have vlmeval installed).
59
+ To find all supported dataset names, please refer to the `vlmeval/dataset/__init__.py` file. The python script \
60
+ to print all supported dataset names is as follows:
61
+ ```python
62
+ from vlmeval.dataset import SUPPORTED_DATASETS
63
+ print(SUPPORTED_DATASETS)
64
+ ```
65
+ or you can check the output of the command `vlmutil dlist all` in the terminal.
66
+ To find all supported video dataset default settings, please refer to the \
67
+ `vlmeval/dataset/video_dataset_config.py` file.
68
+
69
+ --config:
70
+ Launch the evaluation by specifying the path to the config json file. Sample Json Content:
71
+ ```json
72
+ {
73
+ "model": {
74
+ "GPT4o_20240806_T00_HIGH": {
75
+ "class": "GPT4V",
76
+ "model": "gpt-4o-2024-08-06",
77
+ "temperature": 0,
78
+ "img_detail": "high"
79
+ },
80
+ "GPT4o_20240806_T10_Low": {
81
+ "class": "GPT4V",
82
+ "model": "gpt-4o-2024-08-06",
83
+ "temperature": 1.0,
84
+ "img_detail": "low"
85
+ },
86
+ "GPT4o_20241120": {}
87
+ },
88
+ "data": {
89
+ "MME-RealWorld-Lite": {
90
+ "class": "MMERealWorld",
91
+ "dataset": "MME-RealWorld-Lite"
92
+ },
93
+ "MMBench_DEV_EN_V11": {
94
+ "class": "ImageMCQDataset",
95
+ "dataset": "MMBench_DEV_EN_V11"
96
+ },
97
+ "MMBench_Video_8frame_nopack": {},
98
+ "Video-MME_16frame_subs": {
99
+ "class": "VideoMME",
100
+ "dataset": "Video-MME",
101
+ "nframe": 16,
102
+ "use_subtitle": true,
103
+ }
104
+ }
105
+ }
106
+ ```
107
+ Currently, only `model` and `data` are supported fields. The content of each field is a dictionary.
108
+ For `model`, the key is the name of the model, and the value is a dictionary containing the following keys:
109
+ - `class`: The class name of the model, which should be a class in `vlmeval.vlm` or `vlmeval.api`.
110
+ - Other keys are specific to the model, please refer to the corresponding class.
111
+ - Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut.
112
+ For `data`, the key is the name of the dataset (should be the same as the `dataset` field in most cases, \
113
+ except for video datasets), and the value is a dictionary containing the following keys:
114
+ - `class`: The class name of the dataset, which should be a class in `vlmeval.dataset`.
115
+ - `dataset`: The name of the dataset, which should be a string that is accepted by the `dataset` argument of the \
116
+ corresponding class.
117
+ - Other keys are specific to the dataset, please refer to the corresponding class.
118
+ - Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` \
119
+ can be used as a shortcut.
120
+
121
+ The keys in the `model` and `data` fields will be used for naming the prediction files and evaluation results.
122
+ When launching with `--config`, args for API VLMs, such as `--retry`, `--verbose`, will be ignored.
123
+ """
124
+ parser = argparse.ArgumentParser(description=help_msg, formatter_class=argparse.RawTextHelpFormatter)
125
+ # Essential Args, Setting the Names of Datasets and Models
126
+ parser.add_argument('--data', type=str, nargs='+', help='Names of Datasets')
127
+ parser.add_argument('--model', type=str, nargs='+', help='Names of Models')
128
+ parser.add_argument('--config', type=str, help='Path to the Config Json File')
129
+ # Work Dir
130
+ parser.add_argument('--work-dir', type=str, default='./outputs', help='select the output directory')
131
+ # Infer + Eval or Infer Only
132
+ parser.add_argument('--mode', type=str, default='all', choices=['all', 'infer'])
133
+ # API Kwargs, Apply to API VLMs and Judge API LLMs
134
+ parser.add_argument('--api_nproc', type=int, default=4, help='Parallel API calling')
135
+ parser.add_argument('--retry', type=int, default=None, help='retry numbers for API VLMs')
136
+ # Explicitly Set the Judge Model
137
+ parser.add_argument('--judge', type=str, default=None)
138
+ # Logging Utils
139
+ parser.add_argument('--verbose', action='store_true')
140
+ # Configuration for Resume
141
+ # Ignore: will not rerun failed VLM inference
142
+ parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
143
+ # Reuse: will reuse the existing prediction files
144
+ parser.add_argument('--reuse', action='store_true')
145
+
146
+ args = parser.parse_args()
147
+ return args
148
+
149
+
150
+ def main():
151
+ logger = get_logger('RUN')
152
+ rank, world_size = get_rank_and_world_size()
153
+ args = parse_args()
154
+ use_config, cfg = False, None
155
+ if args.config is not None:
156
+ assert args.data is None and args.model is None, '--data and --model should not be set when using --config'
157
+ use_config, cfg = True, load(args.config)
158
+ args.model = list(cfg['model'].keys())
159
+ args.data = list(cfg['data'].keys())
160
+ else:
161
+ assert len(args.data), '--data should be a list of data files'
162
+
163
+ if rank == 0:
164
+ if not args.reuse:
165
+ logger.warning('--reuse is not set, will not reuse previous (before one day) temporary files')
166
+ else:
167
+ logger.warning('--reuse is set, will reuse the latest prediction & temporary pickle files')
168
+
169
+ if 'MMEVAL_ROOT' in os.environ:
170
+ args.work_dir = os.environ['MMEVAL_ROOT']
171
+
172
+ if not use_config:
173
+ for k, v in supported_VLM.items():
174
+ if hasattr(v, 'keywords') and 'retry' in v.keywords and args.retry is not None:
175
+ v.keywords['retry'] = args.retry
176
+ supported_VLM[k] = v
177
+ if hasattr(v, 'keywords') and 'verbose' in v.keywords and args.verbose is not None:
178
+ v.keywords['verbose'] = args.verbose
179
+ supported_VLM[k] = v
180
+
181
+ if world_size > 1:
182
+ local_rank = os.environ.get('LOCAL_RANK', 0)
183
+ torch.cuda.set_device(int(local_rank))
184
+ dist.init_process_group(
185
+ backend='nccl',
186
+ timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
187
+ )
188
+
189
+ for _, model_name in enumerate(args.model):
190
+ model = None
191
+ date, commit_id = timestr('day'), githash(digits=8)
192
+ eval_id = f"T{date}_G{commit_id}"
193
+
194
+ pred_root = osp.join(args.work_dir, model_name, eval_id)
195
+ pred_root_meta = osp.join(args.work_dir, model_name)
196
+ os.makedirs(pred_root_meta, exist_ok=True)
197
+
198
+ prev_pred_roots = ls(osp.join(args.work_dir, model_name), mode='dir')
199
+ if len(prev_pred_roots) and args.reuse:
200
+ prev_pred_roots.sort()
201
+
202
+ if not osp.exists(pred_root):
203
+ os.makedirs(pred_root, exist_ok=True)
204
+
205
+ if use_config:
206
+ model = build_model_from_config(cfg['model'], model_name)
207
+
208
+ for _, dataset_name in enumerate(args.data):
209
+ try:
210
+ result_file_base = f'{model_name}_{dataset_name}.xlsx'
211
+
212
+ if use_config:
213
+ if world_size > 1:
214
+ if rank == 0:
215
+ dataset = build_dataset_from_config(cfg['data'], dataset_name)
216
+ dist.barrier()
217
+ dataset = build_dataset_from_config(cfg['data'], dataset_name)
218
+ if dataset is None:
219
+ logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
220
+ continue
221
+ else:
222
+ dataset_kwargs = {}
223
+ if dataset_name in ['MMLongBench_DOC', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI']:
224
+ dataset_kwargs['model'] = model_name
225
+
226
+ # If distributed, first build the dataset on the main process for doing preparation works
227
+ if world_size > 1:
228
+ if rank == 0:
229
+ dataset = build_dataset(dataset_name, **dataset_kwargs)
230
+ dist.barrier()
231
+
232
+ dataset = build_dataset(dataset_name, **dataset_kwargs)
233
+ if dataset is None:
234
+ logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
235
+ continue
236
+
237
+ # Handling Multi-Turn Dataset
238
+ if dataset.TYPE == 'MT':
239
+ result_file_base = result_file_base.replace('.xlsx', '.tsv')
240
+
241
+ result_file = osp.join(pred_root, result_file_base)
242
+
243
+ # Reuse the previous prediction file if exists
244
+ if rank == 0 and len(prev_pred_roots):
245
+ prev_result_file = None
246
+ prev_pkl_file_list = []
247
+ for root in prev_pred_roots[::-1]:
248
+ if osp.exists(osp.join(root, result_file_base)):
249
+ prev_result_file = osp.join(root, result_file_base)
250
+ break
251
+ elif commit_id in root and len(ls(root)) and root != pred_root:
252
+ temp_files = ls(root, match=[dataset_name, '.pkl'])
253
+ if len(temp_files):
254
+ prev_pkl_file_list.extend(temp_files)
255
+ break
256
+ if not args.reuse:
257
+ prev_result_file = None
258
+ prev_pkl_file_list = []
259
+ if prev_result_file is not None:
260
+ logger.warning(
261
+ f'--reuse is set, will reuse the prediction file {prev_result_file}.')
262
+ if prev_result_file != result_file:
263
+ shutil.copy(prev_result_file, result_file)
264
+ elif len(prev_pkl_file_list):
265
+ for fname in prev_pkl_file_list:
266
+ target_path = osp.join(pred_root, osp.basename(fname))
267
+ if not osp.exists(target_path):
268
+ shutil.copy(fname, target_path)
269
+ logger.info(f'--reuse is set, will reuse the prediction pickle file {fname}.')
270
+ else:
271
+ logger.warning(f'File already exists: {target_path}')
272
+
273
+ if world_size > 1:
274
+ dist.barrier()
275
+
276
+ if model is None:
277
+ model = model_name # which is only a name
278
+
279
+ # Perform the Inference
280
+ if dataset.MODALITY == 'VIDEO':
281
+ model = infer_data_job_video(
282
+ model,
283
+ work_dir=pred_root,
284
+ model_name=model_name,
285
+ dataset=dataset,
286
+ result_file_name=result_file_base,
287
+ verbose=args.verbose,
288
+ api_nproc=args.api_nproc)
289
+ elif dataset.TYPE == 'MT':
290
+ model = infer_data_job_mt(
291
+ model,
292
+ work_dir=pred_root,
293
+ model_name=model_name,
294
+ dataset=dataset,
295
+ verbose=args.verbose,
296
+ api_nproc=args.api_nproc,
297
+ ignore_failed=args.ignore)
298
+ else:
299
+ model = infer_data_job(
300
+ model,
301
+ work_dir=pred_root,
302
+ model_name=model_name,
303
+ dataset=dataset,
304
+ verbose=args.verbose,
305
+ api_nproc=args.api_nproc,
306
+ ignore_failed=args.ignore)
307
+
308
+ # Set the judge kwargs first before evaluation or dumping
309
+
310
+ judge_kwargs = {
311
+ 'nproc': args.api_nproc,
312
+ 'verbose': args.verbose,
313
+ 'retry': args.retry if args.retry is not None else 3
314
+ }
315
+
316
+ if args.retry is not None:
317
+ judge_kwargs['retry'] = args.retry
318
+ if args.judge is not None:
319
+ judge_kwargs['model'] = args.judge
320
+ else:
321
+ if dataset.TYPE in ['MCQ', 'Y/N']:
322
+ judge_kwargs['model'] = 'chatgpt-0125'
323
+ elif listinstr(['MMVet', 'LLaVABench', 'MMBench-Video'], dataset_name):
324
+ judge_kwargs['model'] = 'gpt-4-turbo'
325
+ elif listinstr(['MathVista', 'MathVerse', 'MathVision', 'DynaMath', 'VL-RewardBench', 'WeMath', 'LogicVista'], dataset_name): # noqa: E501
326
+ judge_kwargs['model'] = 'gpt-4o-mini'
327
+ elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'SLIDEVQA', 'MIA-Bench', 'WildVision'], dataset_name): # noqa: E501
328
+ judge_kwargs['model'] = 'gpt-4o'
329
+
330
+ if rank == 0:
331
+ logger.info(judge_kwargs)
332
+
333
+ if world_size > 1:
334
+ dist.barrier()
335
+
336
+ # Only Rank 0 handles the evaluation part
337
+ if rank == 0:
338
+ # Prepare Submission Files for MMMU_TEST AND MMT-Bench_ALL
339
+ if dataset_name in ['MMMU_TEST']:
340
+ result_json = MMMU_result_transfer(result_file)
341
+ logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
342
+ f'json file saved in {result_json}')
343
+ continue
344
+ elif 'MMT-Bench_ALL' in dataset_name:
345
+ submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
346
+ logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
347
+ f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
348
+ f'submission file saved in {submission_file}')
349
+ continue
350
+
351
+ # Skip the evaluation part if only infer
352
+ if args.mode == 'infer':
353
+ continue
354
+
355
+ # Skip the evaluation part if the dataset evaluation is not supported or annotations are missing
356
+ if 'MLLMGuard_DS' in dataset_name:
357
+ logger.info('The evaluation of MLLMGuard_DS is not supported yet. ')
358
+ continue
359
+ elif 'AesBench_TEST' == dataset_name:
360
+ logger.info(f'The results are saved in {result_file}. '
361
+ f'Please send it to the AesBench Team via huangyipo@hotmail.com.')
362
+ continue
363
+ elif dataset_name in ['DocVQA_TEST', 'InfoVQA_TEST', 'Q-Bench1_TEST', 'A-Bench_TEST']:
364
+ logger.info(f'{dataset_name} is a test split without ground-truth. '
365
+ 'Thus only the inference part is supported for those datasets. ')
366
+ continue
367
+ elif dataset_name in [
368
+ 'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN',
369
+ 'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
370
+ ] and not MMBenchOfficialServer(dataset_name):
371
+ logger.error(
372
+ f'Can not evaluate {dataset_name} on non-official servers, will skip the evaluation.')
373
+ continue
374
+
375
+ # Setup the proxy for the evaluation
376
+ eval_proxy = os.environ.get('EVAL_PROXY', None)
377
+ old_proxy = os.environ.get('HTTP_PROXY', '')
378
+ if eval_proxy is not None:
379
+ proxy_set(eval_proxy)
380
+
381
+ # Perform the Evaluation
382
+ eval_results = dataset.evaluate(result_file, **judge_kwargs)
383
+ # Display Evaluation Results in Terminal
384
+ if eval_results is not None:
385
+ assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
386
+ logger.info(f'The evaluation of model {model_name} x dataset {dataset_name} has finished! ')
387
+ logger.info('Evaluation Results:')
388
+ if isinstance(eval_results, dict):
389
+ logger.info('\n' + json.dumps(eval_results, indent=4))
390
+ elif isinstance(eval_results, pd.DataFrame):
391
+ if len(eval_results) < len(eval_results.columns):
392
+ eval_results = eval_results.T
393
+ logger.info('\n' + tabulate(eval_results))
394
+
395
+ # Restore the proxy
396
+ if eval_proxy is not None:
397
+ proxy_set(old_proxy)
398
+
399
+ # Create the symbolic links for the prediction files
400
+ files = os.listdir(pred_root)
401
+ files = [x for x in files if (f'{model_name}_{dataset_name}' in x or "status.json" in x)]
402
+ for f in files:
403
+ cwd = os.getcwd()
404
+ file_addr = osp.join(cwd, pred_root, f)
405
+ link_addr = osp.join(cwd, pred_root_meta, f)
406
+ if osp.exists(link_addr) or osp.islink(link_addr):
407
+ os.remove(link_addr)
408
+ os.symlink(file_addr, link_addr)
409
+
410
+ except Exception as e:
411
+ logger.exception(f'Model {model_name} x Dataset {dataset_name} combination failed: {e}, '
412
+ 'skipping this combination.')
413
+ continue
414
+
415
+ if world_size > 1:
416
+ dist.barrier()
417
+
418
+ if world_size > 1:
419
+ dist.destroy_process_group()
420
+
421
+
422
+ if __name__ == '__main__':
423
+ load_env()
424
+ main()
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/scripts/run_inference.sh ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ export PATH=/usr/local/cuda/bin:$PATH
2
+
3
+ export HF_ENDPOINT=https://hf-mirror.com
4
+ export OMP_NUM_THREADS=1
5
+ export timestamp=`date +"%Y%m%d%H%M%S"`
6
+ export OLD_VERSION='False'
7
+ export PYTHONPATH=$(dirname $SELF_DIR):$PYTHONPATH
8
+ export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
9
+
10
+ # gpu consumed
11
+ # fp16 17-18G
12
+ # int4 7-8G
13
+
14
+ # model to be used
15
+ # Example: MODELNAME=MiniCPM-o-2_6
16
+ MODELNAME=$1
17
+ # datasets to be tested
18
+ # Example: DATALIST=MMMU_DEV_VAL
19
+ DATALIST=$2
20
+
21
+ # run on multi gpus with torchrun command
22
+ # remember to run twice, the first run may fail
23
+ for DATASET in $DATALIST; do
24
+ echo "Starting inference with model $MODELNAME on dataset $DATASET"
25
+ torchrun --master_port 29500 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
26
+ torchrun --master_port 29501 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
27
+
28
+ # for benchmarks which require gpt for scoring, you need to specify OPENAI_API_BASE and OPENAI_API_KEY in .env file
29
+ if [[ "$DATASET" == *"MMBench_TEST"*]]; then
30
+ echo "Skipping evaluation for dataset $DATASET"
31
+ else
32
+ echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
33
+ python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
34
+ fi
35
+ done
36
+
37
+ # run on single gpu with python command
38
+ # python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
39
+ # python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
40
+ # echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
41
+ # python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/setup.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import sys
3
+ from os.path import exists
4
+ from setuptools import find_packages, setup
5
+
6
+
7
+ def parse_requirements(fname='requirements.txt', with_version=True):
8
+ """Parse the package dependencies listed in a requirements file but strips
9
+ specific versioning information.
10
+
11
+ Args:
12
+ fname (str): path to requirements file
13
+ with_version (bool, default=False): if True include version specs
14
+
15
+ Returns:
16
+ List[str]: list of requirements items
17
+
18
+ CommandLine:
19
+ python -c "import setup; print(setup.parse_requirements())"
20
+ """
21
+
22
+ require_fpath = fname
23
+
24
+ def parse_line(line):
25
+ """Parse information from a line in a requirements text file."""
26
+ if line.startswith('-r '):
27
+ # Allow specifying requirements in other files
28
+ target = line.split(' ')[1]
29
+ for info in parse_require_file(target):
30
+ yield info
31
+ else:
32
+ info = {'line': line}
33
+ if line.startswith('-e '):
34
+ info['package'] = line.split('#egg=')[1]
35
+ elif '@git+' in line:
36
+ info['package'] = line
37
+ else:
38
+ # Remove versioning from the package
39
+ pat = '(' + '|'.join(['>=', '==', '>']) + ')'
40
+ parts = re.split(pat, line, maxsplit=1)
41
+ parts = [p.strip() for p in parts]
42
+
43
+ info['package'] = parts[0]
44
+ if len(parts) > 1:
45
+ op, rest = parts[1:]
46
+ if ';' in rest:
47
+ # Handle platform specific dependencies
48
+ # http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
49
+ version, platform_deps = map(str.strip,
50
+ rest.split(';'))
51
+ info['platform_deps'] = platform_deps
52
+ else:
53
+ version = rest # NOQA
54
+ info['version'] = (op, version)
55
+ yield info
56
+
57
+ def parse_require_file(fpath):
58
+ with open(fpath, 'r') as f:
59
+ for line in f.readlines():
60
+ line = line.strip()
61
+ if line and not line.startswith('#'):
62
+ for info in parse_line(line):
63
+ yield info
64
+
65
+ def gen_packages_items():
66
+ if exists(require_fpath):
67
+ for info in parse_require_file(require_fpath):
68
+ parts = [info['package']]
69
+ if with_version and 'version' in info:
70
+ parts.extend(info['version'])
71
+ if not sys.version.startswith('3.4'):
72
+ # apparently package_deps are broken in 3.4
73
+ platform_deps = info.get('platform_deps')
74
+ if platform_deps is not None:
75
+ parts.append(';' + platform_deps)
76
+ item = ''.join(parts)
77
+ yield item
78
+
79
+ packages = list(gen_packages_items())
80
+ return packages
81
+
82
+
83
+ with open('README.md') as f:
84
+ readme = f.read()
85
+
86
+
87
+ def do_setup():
88
+ setup(
89
+ name='vlmeval',
90
+ version='0.1.0',
91
+ description='OpenCompass VLM Evaluation Kit',
92
+ author='Haodong Duan',
93
+ author_email='dhd.efz@gmail.com',
94
+ maintainer='Haodong Duan',
95
+ maintainer_email='dhd.efz@gmail.com',
96
+ long_description=readme,
97
+ long_description_content_type='text/markdown',
98
+ cmdclass={},
99
+ install_requires=parse_requirements('requirements.txt'),
100
+ setup_requires=[],
101
+ python_requires='>=3.7.0',
102
+ packages=find_packages(exclude=[
103
+ 'test*',
104
+ 'paper_test*',
105
+ ]),
106
+ keywords=['AI', 'NLP', 'in-context learning'],
107
+ entry_points={
108
+ 'console_scripts': ['vlmutil = vlmeval:cli']
109
+ },
110
+ classifiers=[
111
+ 'Programming Language :: Python :: 3.7',
112
+ 'Programming Language :: Python :: 3.8',
113
+ 'Programming Language :: Python :: 3.9',
114
+ 'Programming Language :: Python :: 3.10',
115
+ 'Intended Audience :: Developers',
116
+ 'Intended Audience :: Education',
117
+ 'Intended Audience :: Science/Research',
118
+ ])
119
+
120
+
121
+ if __name__ == '__main__':
122
+ do_setup()
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/__init__.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ try:
2
+ import torch
3
+ except ImportError:
4
+ pass
5
+
6
+ from .smp import *
7
+ from .api import *
8
+ from .dataset import *
9
+ from .utils import *
10
+ from .vlm import *
11
+ from .config import *
12
+ from .tools import cli
13
+
14
+ load_env()
15
+
16
+ __version__ = '0.2rc1'
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ from .gpt import OpenAIWrapper, GPT4V
2
+
3
+ __all__ = [
4
+ 'OpenAIWrapper', 'GPT4V',
5
+ ]
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/base.py ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ import random as rd
3
+ from abc import abstractmethod
4
+ import os.path as osp
5
+ import copy as cp
6
+ from ..smp import get_logger, parse_file, concat_images_vlmeval, LMUDataRoot, md5, decode_base64_to_image_file
7
+
8
+
9
+ class BaseAPI:
10
+
11
+ allowed_types = ['text', 'image']
12
+ INTERLEAVE = True
13
+ INSTALL_REQ = False
14
+
15
+ def __init__(self,
16
+ retry=10,
17
+ wait=3,
18
+ system_prompt=None,
19
+ verbose=True,
20
+ fail_msg='Failed to obtain answer via API.',
21
+ **kwargs):
22
+ """Base Class for all APIs.
23
+
24
+ Args:
25
+ retry (int, optional): The retry times for `generate_inner`. Defaults to 10.
26
+ wait (int, optional): The wait time after each failed retry of `generate_inner`. Defaults to 3.
27
+ system_prompt (str, optional): Defaults to None.
28
+ verbose (bool, optional): Defaults to True.
29
+ fail_msg (str, optional): The message to return when failed to obtain answer.
30
+ Defaults to 'Failed to obtain answer via API.'.
31
+ **kwargs: Other kwargs for `generate_inner`.
32
+ """
33
+
34
+ self.wait = wait
35
+ self.retry = retry
36
+ self.system_prompt = system_prompt
37
+ self.verbose = verbose
38
+ self.fail_msg = fail_msg
39
+ self.logger = get_logger('ChatAPI')
40
+
41
+ if len(kwargs):
42
+ self.logger.info(f'BaseAPI received the following kwargs: {kwargs}')
43
+ self.logger.info('Will try to use them as kwargs for `generate`. ')
44
+ self.default_kwargs = kwargs
45
+
46
+ @abstractmethod
47
+ def generate_inner(self, inputs, **kwargs):
48
+ """The inner function to generate the answer.
49
+
50
+ Returns:
51
+ tuple(int, str, str): ret_code, response, log
52
+ """
53
+ self.logger.warning('For APIBase, generate_inner is an abstract method. ')
54
+ assert 0, 'generate_inner not defined'
55
+ ret_code, answer, log = None, None, None
56
+ # if ret_code is 0, means succeed
57
+ return ret_code, answer, log
58
+
59
+ def working(self):
60
+ """If the API model is working, return True, else return False.
61
+
62
+ Returns:
63
+ bool: If the API model is working, return True, else return False.
64
+ """
65
+ self.old_timeout = None
66
+ if hasattr(self, 'timeout'):
67
+ self.old_timeout = self.timeout
68
+ self.timeout = 120
69
+
70
+ retry = 5
71
+ while retry > 0:
72
+ ret = self.generate('hello')
73
+ if ret is not None and ret != '' and self.fail_msg not in ret:
74
+ if self.old_timeout is not None:
75
+ self.timeout = self.old_timeout
76
+ return True
77
+ retry -= 1
78
+
79
+ if self.old_timeout is not None:
80
+ self.timeout = self.old_timeout
81
+ return False
82
+
83
+ def check_content(self, msgs):
84
+ """Check the content type of the input. Four types are allowed: str, dict, liststr, listdict.
85
+
86
+ Args:
87
+ msgs: Raw input messages.
88
+
89
+ Returns:
90
+ str: The message type.
91
+ """
92
+ if isinstance(msgs, str):
93
+ return 'str'
94
+ if isinstance(msgs, dict):
95
+ return 'dict'
96
+ if isinstance(msgs, list):
97
+ types = [self.check_content(m) for m in msgs]
98
+ if all(t == 'str' for t in types):
99
+ return 'liststr'
100
+ if all(t == 'dict' for t in types):
101
+ return 'listdict'
102
+ return 'unknown'
103
+
104
+ def preproc_content(self, inputs):
105
+ """Convert the raw input messages to a list of dicts.
106
+
107
+ Args:
108
+ inputs: raw input messages.
109
+
110
+ Returns:
111
+ list(dict): The preprocessed input messages. Will return None if failed to preprocess the input.
112
+ """
113
+ if self.check_content(inputs) == 'str':
114
+ return [dict(type='text', value=inputs)]
115
+ elif self.check_content(inputs) == 'dict':
116
+ assert 'type' in inputs and 'value' in inputs
117
+ return [inputs]
118
+ elif self.check_content(inputs) == 'liststr':
119
+ res = []
120
+ for s in inputs:
121
+ mime, pth = parse_file(s)
122
+ if mime is None or mime == 'unknown':
123
+ res.append(dict(type='text', value=s))
124
+ else:
125
+ res.append(dict(type=mime.split('/')[0], value=pth))
126
+ return res
127
+ elif self.check_content(inputs) == 'listdict':
128
+ for item in inputs:
129
+ assert 'type' in item and 'value' in item
130
+ mime, s = parse_file(item['value'])
131
+ if mime is None:
132
+ assert item['type'] == 'text', item['value']
133
+ else:
134
+ assert mime.split('/')[0] == item['type']
135
+ item['value'] = s
136
+ return inputs
137
+ else:
138
+ return None
139
+
140
+ # May exceed the context windows size, so try with different turn numbers.
141
+ def chat_inner(self, inputs, **kwargs):
142
+ _ = kwargs.pop('dataset', None)
143
+ while len(inputs):
144
+ try:
145
+ return self.generate_inner(inputs, **kwargs)
146
+ except Exception as e:
147
+ if self.verbose:
148
+ self.logger.info(f'{type(e)}: {e}')
149
+ inputs = inputs[1:]
150
+ while len(inputs) and inputs[0]['role'] != 'user':
151
+ inputs = inputs[1:]
152
+ continue
153
+ return -1, self.fail_msg + ': ' + 'Failed with all possible conversation turns.', None
154
+
155
+ def chat(self, messages, **kwargs1):
156
+ """The main function for multi-turn chatting. Will call `chat_inner` with the preprocessed input messages."""
157
+ assert hasattr(self, 'chat_inner'), 'The API model should has the `chat_inner` method. '
158
+ for msg in messages:
159
+ assert isinstance(msg, dict) and 'role' in msg and 'content' in msg, msg
160
+ assert self.check_content(msg['content']) in ['str', 'dict', 'liststr', 'listdict'], msg
161
+ msg['content'] = self.preproc_content(msg['content'])
162
+ # merge kwargs
163
+ kwargs = cp.deepcopy(self.default_kwargs)
164
+ kwargs.update(kwargs1)
165
+
166
+ answer = None
167
+ # a very small random delay [0s - 0.5s]
168
+ T = rd.random() * 0.5
169
+ time.sleep(T)
170
+
171
+ assert messages[-1]['role'] == 'user'
172
+
173
+ for i in range(self.retry):
174
+ try:
175
+ ret_code, answer, log = self.chat_inner(messages, **kwargs)
176
+ if ret_code == 0 and self.fail_msg not in answer and answer != '':
177
+ if self.verbose:
178
+ print(answer)
179
+ return answer
180
+ elif self.verbose:
181
+ if not isinstance(log, str):
182
+ try:
183
+ log = log.text
184
+ except Exception as e:
185
+ self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
186
+ self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
187
+ except Exception as err:
188
+ if self.verbose:
189
+ self.logger.error(f'An error occured during try {i}: ')
190
+ self.logger.error(f'{type(err)}: {err}')
191
+ # delay before each retry
192
+ T = rd.random() * self.wait * 2
193
+ time.sleep(T)
194
+
195
+ return self.fail_msg if answer in ['', None] else answer
196
+
197
+ def preprocess_message_with_role(self, message):
198
+ system_prompt = ''
199
+ new_message = []
200
+
201
+ for data in message:
202
+ assert isinstance(data, dict)
203
+ role = data.pop('role', 'user')
204
+ if role == 'system':
205
+ system_prompt += data['value'] + '\n'
206
+ else:
207
+ new_message.append(data)
208
+
209
+ if system_prompt != '':
210
+ if self.system_prompt is None:
211
+ self.system_prompt = system_prompt
212
+ else:
213
+ self.system_prompt += '\n' + system_prompt
214
+ return new_message
215
+
216
+ def generate(self, message, **kwargs1):
217
+ """The main function to generate the answer. Will call `generate_inner` with the preprocessed input messages.
218
+
219
+ Args:
220
+ message: raw input messages.
221
+
222
+ Returns:
223
+ str: The generated answer of the Failed Message if failed to obtain answer.
224
+ """
225
+ if self.check_content(message) == 'listdict':
226
+ message = self.preprocess_message_with_role(message)
227
+
228
+ assert self.check_content(message) in ['str', 'dict', 'liststr', 'listdict'], f'Invalid input type: {message}'
229
+ message = self.preproc_content(message)
230
+ assert message is not None and self.check_content(message) == 'listdict'
231
+ for item in message:
232
+ assert item['type'] in self.allowed_types, f'Invalid input type: {item["type"]}'
233
+
234
+ # merge kwargs
235
+ kwargs = cp.deepcopy(self.default_kwargs)
236
+ kwargs.update(kwargs1)
237
+
238
+ answer = None
239
+ # a very small random delay [0s - 0.5s]
240
+ T = rd.random() * 0.5
241
+ time.sleep(T)
242
+
243
+ for i in range(self.retry):
244
+ try:
245
+ ret_code, answer, log = self.generate_inner(message, **kwargs)
246
+ if ret_code == 0 and self.fail_msg not in answer and answer != '':
247
+ if self.verbose:
248
+ print(answer)
249
+ return answer
250
+ elif self.verbose:
251
+ if not isinstance(log, str):
252
+ try:
253
+ log = log.text
254
+ except Exception as e:
255
+ self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
256
+ self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
257
+ except Exception as err:
258
+ if self.verbose:
259
+ self.logger.error(f'An error occured during try {i}: ')
260
+ self.logger.error(f'{type(err)}: {err}')
261
+ # delay before each retry
262
+ T = rd.random() * self.wait * 2
263
+ time.sleep(T)
264
+
265
+ return self.fail_msg if answer in ['', None] else answer
266
+
267
+ def message_to_promptimg(self, message, dataset=None):
268
+ assert not self.INTERLEAVE
269
+ model_name = self.__class__.__name__
270
+ import warnings
271
+ warnings.warn(
272
+ f'Model {model_name} does not support interleaved input. '
273
+ 'Will use the first image and aggregated texts as prompt. ')
274
+ num_images = len([x for x in message if x['type'] == 'image'])
275
+ if num_images == 0:
276
+ prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
277
+ image = None
278
+ elif num_images == 1:
279
+ prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
280
+ image = [x['value'] for x in message if x['type'] == 'image'][0]
281
+ else:
282
+ prompt = '\n'.join([x['value'] if x['type'] == 'text' else '<image>' for x in message])
283
+ if dataset == 'BLINK':
284
+ image = concat_images_vlmeval(
285
+ [x['value'] for x in message if x['type'] == 'image'],
286
+ target_size=512)
287
+ else:
288
+ image = [x['value'] for x in message if x['type'] == 'image'][0]
289
+ return prompt, image
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/gpt.py ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from ..smp import *
2
+ import os
3
+ import sys
4
+ from .base import BaseAPI
5
+
6
+ APIBASES = {
7
+ 'OFFICIAL': 'https://api.openai.com/v1/chat/completions',
8
+ }
9
+
10
+
11
+ def GPT_context_window(model):
12
+ length_map = {
13
+ 'gpt-4': 8192,
14
+ 'gpt-4-0613': 8192,
15
+ 'gpt-4-turbo-preview': 128000,
16
+ 'gpt-4-1106-preview': 128000,
17
+ 'gpt-4-0125-preview': 128000,
18
+ 'gpt-4-vision-preview': 128000,
19
+ 'gpt-4-turbo': 128000,
20
+ 'gpt-4-turbo-2024-04-09': 128000,
21
+ 'gpt-3.5-turbo': 16385,
22
+ 'gpt-3.5-turbo-0125': 16385,
23
+ 'gpt-3.5-turbo-1106': 16385,
24
+ 'gpt-3.5-turbo-instruct': 4096,
25
+ }
26
+ if model in length_map:
27
+ return length_map[model]
28
+ else:
29
+ return 128000
30
+
31
+
32
+ class OpenAIWrapper(BaseAPI):
33
+
34
+ is_api: bool = True
35
+
36
+ def __init__(self,
37
+ model: str = 'gpt-3.5-turbo-0613',
38
+ retry: int = 5,
39
+ wait: int = 5,
40
+ key: str = None,
41
+ verbose: bool = False,
42
+ system_prompt: str = None,
43
+ temperature: float = 0,
44
+ timeout: int = 60,
45
+ api_base: str = None,
46
+ max_tokens: int = 1024,
47
+ img_size: int = 512,
48
+ img_detail: str = 'low',
49
+ use_azure: bool = False,
50
+ **kwargs):
51
+
52
+ self.model = model
53
+ self.cur_idx = 0
54
+ self.fail_msg = 'Failed to obtain answer via API. '
55
+ self.max_tokens = max_tokens
56
+ self.temperature = temperature
57
+ self.use_azure = use_azure
58
+
59
+ if 'step' in model:
60
+ env_key = os.environ.get('STEPAI_API_KEY', '')
61
+ if key is None:
62
+ key = env_key
63
+ elif 'yi-vision' in model:
64
+ env_key = os.environ.get('YI_API_KEY', '')
65
+ if key is None:
66
+ key = env_key
67
+ elif 'internvl2-pro' in model:
68
+ env_key = os.environ.get('InternVL2_PRO_KEY', '')
69
+ if key is None:
70
+ key = env_key
71
+ elif 'abab' in model:
72
+ env_key = os.environ.get('MiniMax_API_KEY', '')
73
+ if key is None:
74
+ key = env_key
75
+ else:
76
+ if use_azure:
77
+ env_key = os.environ.get('AZURE_OPENAI_API_KEY', None)
78
+ assert env_key is not None, 'Please set the environment variable AZURE_OPENAI_API_KEY. '
79
+
80
+ if key is None:
81
+ key = env_key
82
+ assert isinstance(key, str), (
83
+ 'Please set the environment variable AZURE_OPENAI_API_KEY to your openai key. '
84
+ )
85
+ else:
86
+ env_key = os.environ.get('OPENAI_API_KEY', '')
87
+ if key is None:
88
+ key = env_key
89
+ assert isinstance(key, str) and key.startswith('sk-'), (
90
+ f'Illegal openai_key {key}. '
91
+ 'Please set the environment variable OPENAI_API_KEY to your openai key. '
92
+ )
93
+
94
+ self.key = key
95
+ assert img_size > 0 or img_size == -1
96
+ self.img_size = img_size
97
+ assert img_detail in ['high', 'low']
98
+ self.img_detail = img_detail
99
+ self.timeout = timeout
100
+
101
+ super().__init__(wait=wait, retry=retry, system_prompt=system_prompt, verbose=verbose, **kwargs)
102
+
103
+ if use_azure:
104
+ api_base_template = (
105
+ '{endpoint}openai/deployments/{deployment_name}/chat/completions?api-version={api_version}'
106
+ )
107
+ endpoint = os.getenv('AZURE_OPENAI_ENDPOINT', None)
108
+ assert endpoint is not None, 'Please set the environment variable AZURE_OPENAI_ENDPOINT. '
109
+ deployment_name = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME', None)
110
+ assert deployment_name is not None, 'Please set the environment variable AZURE_OPENAI_DEPLOYMENT_NAME. '
111
+ api_version = os.getenv('OPENAI_API_VERSION', None)
112
+ assert api_version is not None, 'Please set the environment variable OPENAI_API_VERSION. '
113
+
114
+ self.api_base = api_base_template.format(
115
+ endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
116
+ deployment_name=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
117
+ api_version=os.getenv('OPENAI_API_VERSION')
118
+ )
119
+ else:
120
+ if api_base is None:
121
+ if 'OPENAI_API_BASE' in os.environ and os.environ['OPENAI_API_BASE'] != '':
122
+ self.logger.info('Environment variable OPENAI_API_BASE is set. Will use it as api_base. ')
123
+ api_base = os.environ['OPENAI_API_BASE']
124
+ else:
125
+ api_base = 'OFFICIAL'
126
+
127
+ assert api_base is not None
128
+
129
+ if api_base in APIBASES:
130
+ self.api_base = APIBASES[api_base]
131
+ elif api_base.startswith('http'):
132
+ self.api_base = api_base
133
+ else:
134
+ self.logger.error('Unknown API Base. ')
135
+ raise NotImplementedError
136
+
137
+ self.logger.info(f'Using API Base: {self.api_base}; API Key: {self.key}')
138
+
139
+ # inputs can be a lvl-2 nested list: [content1, content2, content3, ...]
140
+ # content can be a string or a list of image & text
141
+ def prepare_itlist(self, inputs):
142
+ assert np.all([isinstance(x, dict) for x in inputs])
143
+ has_images = np.sum([x['type'] == 'image' for x in inputs])
144
+ if has_images:
145
+ content_list = []
146
+ for msg in inputs:
147
+ if msg['type'] == 'text':
148
+ content_list.append(dict(type='text', text=msg['value']))
149
+ elif msg['type'] == 'image':
150
+ from PIL import Image
151
+ img = Image.open(msg['value'])
152
+ b64 = encode_image_to_base64(img, target_size=self.img_size)
153
+ img_struct = dict(url=f'data:image/jpeg;base64,{b64}', detail=self.img_detail)
154
+ content_list.append(dict(type='image_url', image_url=img_struct))
155
+ else:
156
+ assert all([x['type'] == 'text' for x in inputs])
157
+ text = '\n'.join([x['value'] for x in inputs])
158
+ content_list = [dict(type='text', text=text)]
159
+ return content_list
160
+
161
+ def prepare_inputs(self, inputs):
162
+ input_msgs = []
163
+ if self.system_prompt is not None:
164
+ input_msgs.append(dict(role='system', content=self.system_prompt))
165
+ assert isinstance(inputs, list) and isinstance(inputs[0], dict)
166
+ assert np.all(['type' in x for x in inputs]) or np.all(['role' in x for x in inputs]), inputs
167
+ if 'role' in inputs[0]:
168
+ assert inputs[-1]['role'] == 'user', inputs[-1]
169
+ for item in inputs:
170
+ input_msgs.append(dict(role=item['role'], content=self.prepare_itlist(item['content'])))
171
+ else:
172
+ input_msgs.append(dict(role='user', content=self.prepare_itlist(inputs)))
173
+ return input_msgs
174
+
175
+ def generate_inner(self, inputs, **kwargs) -> str:
176
+ input_msgs = self.prepare_inputs(inputs)
177
+ temperature = kwargs.pop('temperature', self.temperature)
178
+ max_tokens = kwargs.pop('max_tokens', self.max_tokens)
179
+
180
+ # context_window = GPT_context_window(self.model)
181
+ # new_max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
182
+ # if 0 < new_max_tokens <= 100 and new_max_tokens < max_tokens:
183
+ # self.logger.warning(
184
+ # 'Less than 100 tokens left, '
185
+ # 'may exceed the context window with some additional meta symbols. '
186
+ # )
187
+ # if new_max_tokens <= 0:
188
+ # return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
189
+ # max_tokens = new_max_tokens
190
+
191
+ # Will send request if use Azure, dk how to use openai client for it
192
+ if self.use_azure:
193
+ headers = {'Content-Type': 'application/json', 'api-key': self.key}
194
+ elif 'internvl2-pro' in self.model:
195
+ headers = {'Content-Type': 'application/json', 'Authorization': self.key}
196
+ else:
197
+ headers = {'Content-Type': 'application/json', 'Authorization': f'Bearer {self.key}'}
198
+ payload = dict(
199
+ model=self.model,
200
+ messages=input_msgs,
201
+ max_tokens=max_tokens,
202
+ n=1,
203
+ temperature=temperature,
204
+ **kwargs)
205
+ response = requests.post(
206
+ self.api_base,
207
+ headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
208
+ ret_code = response.status_code
209
+ ret_code = 0 if (200 <= int(ret_code) < 300) else ret_code
210
+ answer = self.fail_msg
211
+ try:
212
+ resp_struct = json.loads(response.text)
213
+ answer = resp_struct['choices'][0]['message']['content'].strip()
214
+ except Exception as err:
215
+ if self.verbose:
216
+ self.logger.error(f'{type(err)}: {err}')
217
+ self.logger.error(response.text if hasattr(response, 'text') else response)
218
+
219
+ return ret_code, answer, response
220
+
221
+ def get_image_token_len(self, img_path, detail='low'):
222
+ import math
223
+ if detail == 'low':
224
+ return 85
225
+
226
+ im = Image.open(img_path)
227
+ height, width = im.size
228
+ if width > 1024 or height > 1024:
229
+ if width > height:
230
+ height = int(height * 1024 / width)
231
+ width = 1024
232
+ else:
233
+ width = int(width * 1024 / height)
234
+ height = 1024
235
+
236
+ h = math.ceil(height / 512)
237
+ w = math.ceil(width / 512)
238
+ total = 85 + 170 * h * w
239
+ return total
240
+
241
+ def get_token_len(self, inputs) -> int:
242
+ import tiktoken
243
+ try:
244
+ enc = tiktoken.encoding_for_model(self.model)
245
+ except Exception as err:
246
+ if 'gpt' in self.model.lower():
247
+ if self.verbose:
248
+ self.logger.warning(f'{type(err)}: {err}')
249
+ enc = tiktoken.encoding_for_model('gpt-4')
250
+ else:
251
+ return 0
252
+ assert isinstance(inputs, list)
253
+ tot = 0
254
+ for item in inputs:
255
+ if 'role' in item:
256
+ tot += self.get_token_len(item['content'])
257
+ elif item['type'] == 'text':
258
+ tot += len(enc.encode(item['value']))
259
+ elif item['type'] == 'image':
260
+ tot += self.get_image_token_len(item['value'], detail=self.img_detail)
261
+ return tot
262
+
263
+
264
+ class GPT4V(OpenAIWrapper):
265
+
266
+ def generate(self, message, dataset=None):
267
+ return super(GPT4V, self).generate(message)
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/config.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from vlmeval.vlm import *
2
+ from vlmeval.api import *
3
+ from functools import partial
4
+
5
+ minicpm_series = {
6
+ 'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
7
+ 'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
8
+ 'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
9
+ 'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
10
+ 'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
11
+ }
12
+
13
+ supported_VLM = {}
14
+
15
+ model_groups = [
16
+ minicpm_series
17
+ ]
18
+
19
+ for grp in model_groups:
20
+ supported_VLM.update(grp)
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/dataset/__init__.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import warnings
2
+
3
+ from .image_base import img_root_map, ImageBaseDataset
4
+ from .image_caption import ImageCaptionDataset
5
+ from .image_yorn import ImageYORNDataset
6
+ from .image_mcq import (
7
+ ImageMCQDataset, MMMUDataset, CustomMCQDataset, MUIRDataset, GMAIMMBenchDataset, MMERealWorld, HRBenchDataset,
8
+ NaturalBenchDataset
9
+ )
10
+ from .image_mt import MMDUDataset
11
+ from .image_vqa import (
12
+ ImageVQADataset, MathVision, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
13
+ CustomVQADataset, CRPE, MathVerse, OlympiadBench, QSpatial, VizWiz, MMNIAH, WeMath, LogicVista
14
+ )
15
+
16
+ from .image_ccocr import CCOCRDataset
17
+ from .text_mcq import CustomTextMCQDataset, TextMCQDataset
18
+
19
+ from .vcr import VCRDataset
20
+ from .mmlongbench import MMLongBench
21
+ from .dude import DUDE
22
+ from .slidevqa import SlideVQA
23
+ from .vl_rewardbench import VLRewardBench
24
+
25
+ from .mmbench_video import MMBenchVideo
26
+ from .videomme import VideoMME
27
+ from .mvbench import MVBench, MVBench_MP4
28
+ from .mlvu import MLVU, MLVU_MCQ, MLVU_OpenEnded
29
+ from .tempcompass import TempCompass, TempCompass_Captioning, TempCompass_MCQ, TempCompass_YorN
30
+ from .longvideobench import LongVideoBench
31
+ from .video_concat_dataset import ConcatVideoDataset
32
+ from .mmgenbench import MMGenBench
33
+ from .cgbench import CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
34
+
35
+ from .miabench import MIABench
36
+ from .cmmmu import CMMMU
37
+ from .wildvision import WildVision
38
+ from .mmmath import MMMath
39
+ from .dynamath import Dynamath
40
+ from .utils import *
41
+ from .video_dataset_config import *
42
+ from ..smp import *
43
+
44
+
45
+ class ConcatDataset(ImageBaseDataset):
46
+ # This dataset takes multiple dataset names as input and aggregate them into a single dataset.
47
+ # Each single dataset should not have a field named `SUB_DATASET`
48
+
49
+ DATASET_SETS = {
50
+ 'MMMB': ['MMMB_ar', 'MMMB_cn', 'MMMB_en', 'MMMB_pt', 'MMMB_ru', 'MMMB_tr'],
51
+ 'MTL_MMBench_DEV': [
52
+ 'MMBench_dev_ar', 'MMBench_dev_cn', 'MMBench_dev_en',
53
+ 'MMBench_dev_pt', 'MMBench_dev_ru', 'MMBench_dev_tr'
54
+ ]
55
+ }
56
+
57
+ def __init__(self, dataset):
58
+ datasets = self.DATASET_SETS[dataset]
59
+ self.dataset_map = {}
60
+ # The name of the compliation
61
+ self.dataset_name = dataset
62
+ self.datasets = datasets
63
+ for dname in datasets:
64
+ dataset = build_dataset(dname)
65
+ assert dataset is not None, dataset
66
+ self.dataset_map[dname] = dataset
67
+ TYPES = [x.TYPE for x in self.dataset_map.values()]
68
+ MODALITIES = [x.MODALITY for x in self.dataset_map.values()]
69
+ assert np.all([x == TYPES[0] for x in TYPES]), (datasets, TYPES)
70
+ assert np.all([x == MODALITIES[0] for x in MODALITIES]), (datasets, MODALITIES)
71
+ self.TYPE = TYPES[0]
72
+ self.MODALITY = MODALITIES[0]
73
+ data_all = []
74
+ for dname in datasets:
75
+ data = self.dataset_map[dname].data
76
+ data['SUB_DATASET'] = [dname] * len(data)
77
+ data_new = localize_df(data, dname, nproc=16)
78
+ data_all.append(data_new)
79
+
80
+ data = pd.concat(data_all)
81
+ data['original_index'] = data.pop('index')
82
+ data['index'] = np.arange(len(data))
83
+ self.data = data
84
+
85
+ def build_prompt(self, line):
86
+ if isinstance(line, int):
87
+ line = self.data.iloc[line]
88
+ idx = line['original_index']
89
+ dname = line['SUB_DATASET']
90
+ org_data = self.dataset_map[dname].data
91
+ org_line = cp.deepcopy(org_data[org_data['index'] == idx]).iloc[0]
92
+ return self.dataset_map[dname].build_prompt(org_line)
93
+
94
+ def dump_image(self, line):
95
+ # Assert all images are pre-dumped
96
+ assert 'image' not in line
97
+ assert 'image_path' in line
98
+ tgt_path = toliststr(line['image_path'])
99
+ return tgt_path
100
+
101
+ @classmethod
102
+ def supported_datasets(cls):
103
+ return list(cls.DATASET_SETS)
104
+
105
+ def evaluate(self, eval_file, **judge_kwargs):
106
+ suffix = eval_file.split('.')[-1]
107
+ # First, split the eval_file by dataset
108
+ data_all = load(eval_file)
109
+ for dname in self.datasets:
110
+ tgt = eval_file.replace(self.dataset_name, dname)
111
+ data_sub = data_all[data_all['SUB_DATASET'] == dname]
112
+ data_sub.pop('index')
113
+ data_sub['index'] = data_sub.pop('original_index')
114
+ data_sub.pop('SUB_DATASET')
115
+ dump(data_sub, tgt)
116
+ # Then, evaluate each dataset separately
117
+ results_all = []
118
+ for dname in self.datasets:
119
+ tgt = eval_file.replace(self.dataset_name, dname)
120
+ res = self.dataset_map[dname].evaluate(tgt, **judge_kwargs)
121
+ assert isinstance(res, pd.DataFrame)
122
+ res['DATASET'] = [dname] * len(res)
123
+ results_all.append(res)
124
+ result = pd.concat(results_all)
125
+ score_file = eval_file.replace(f'.{suffix}', '_acc.csv')
126
+ dump(result, score_file)
127
+ return result
128
+
129
+
130
+ # Add new supported dataset class here
131
+ IMAGE_DATASET = [
132
+ ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset, MathVision,
133
+ MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
134
+ MMLongBench, VCRDataset, MMDUDataset, DUDE, SlideVQA, MUIRDataset, CCOCRDataset,
135
+ GMAIMMBenchDataset, MMERealWorld, HRBenchDataset, CRPE, MathVerse, NaturalBenchDataset,
136
+ MIABench, OlympiadBench, WildVision, MMMath, QSpatial, Dynamath, MMGenBench, VizWiz, MMNIAH,
137
+ CMMMU, VLRewardBench, WeMath, LogicVista
138
+ ]
139
+
140
+ VIDEO_DATASET = [
141
+ MMBenchVideo, VideoMME, MVBench, MVBench_MP4, LongVideoBench,
142
+ MLVU, MLVU_MCQ, MLVU_OpenEnded,
143
+ TempCompass, TempCompass_MCQ, TempCompass_Captioning, TempCompass_YorN,
144
+ CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
145
+ ]
146
+
147
+ TEXT_DATASET = [
148
+ TextMCQDataset
149
+ ]
150
+
151
+ CUSTOM_DATASET = [
152
+ CustomMCQDataset, CustomVQADataset, CustomTextMCQDataset
153
+ ]
154
+
155
+ DATASET_COLLECTION = [ConcatDataset, ConcatVideoDataset]
156
+
157
+ DATASET_CLASSES = IMAGE_DATASET + VIDEO_DATASET + TEXT_DATASET + CUSTOM_DATASET + DATASET_COLLECTION
158
+ SUPPORTED_DATASETS = []
159
+ for DATASET_CLS in DATASET_CLASSES:
160
+ SUPPORTED_DATASETS.extend(DATASET_CLS.supported_datasets())
161
+
162
+
163
+ def DATASET_TYPE(dataset, *, default: str = 'MCQ') -> str:
164
+ for cls in DATASET_CLASSES:
165
+ if dataset in cls.supported_datasets():
166
+ if hasattr(cls, 'TYPE'):
167
+ return cls.TYPE
168
+ # Have to add specific routine to handle ConcatDataset
169
+ if dataset in ConcatDataset.DATASET_SETS:
170
+ dataset_list = ConcatDataset.DATASET_SETS[dataset]
171
+ TYPES = [DATASET_TYPE(dname) for dname in dataset_list]
172
+ assert np.all([x == TYPES[0] for x in TYPES]), (dataset_list, TYPES)
173
+ return TYPES[0]
174
+
175
+ if 'openended' in dataset.lower():
176
+ return 'VQA'
177
+ warnings.warn(f'Dataset {dataset} is a custom one and not annotated as `openended`, will treat as {default}. ')
178
+ return default
179
+
180
+
181
+ def DATASET_MODALITY(dataset, *, default: str = 'IMAGE') -> str:
182
+ if dataset is None:
183
+ warnings.warn(f'Dataset is not specified, will treat modality as {default}. ')
184
+ return default
185
+ for cls in DATASET_CLASSES:
186
+ if dataset in cls.supported_datasets():
187
+ if hasattr(cls, 'MODALITY'):
188
+ return cls.MODALITY
189
+ # Have to add specific routine to handle ConcatDataset
190
+ if dataset in ConcatDataset.DATASET_SETS:
191
+ dataset_list = ConcatDataset.DATASET_SETS[dataset]
192
+ MODALITIES = [DATASET_MODALITY(dname) for dname in dataset_list]
193
+ assert np.all([x == MODALITIES[0] for x in MODALITIES]), (dataset_list, MODALITIES)
194
+ return MODALITIES[0]
195
+
196
+ if 'VIDEO' in dataset.lower():
197
+ return 'VIDEO'
198
+ elif 'IMAGE' in dataset.lower():
199
+ return 'IMAGE'
200
+ warnings.warn(f'Dataset {dataset} is a custom one, will treat modality as {default}. ')
201
+ return default
202
+
203
+
204
+ def build_dataset(dataset_name, **kwargs):
205
+ for cls in DATASET_CLASSES:
206
+ if dataset_name in supported_video_datasets:
207
+ return supported_video_datasets[dataset_name](**kwargs)
208
+ elif dataset_name in cls.supported_datasets():
209
+ return cls(dataset=dataset_name, **kwargs)
210
+
211
+ warnings.warn(f'Dataset {dataset_name} is not officially supported. ')
212
+
213
+ data_file = osp.join(LMUDataRoot(), f'{dataset_name}.tsv')
214
+ if not osp.exists(data_file):
215
+ warnings.warn(f'Data file {data_file} does not exist. Dataset building failed. ')
216
+ return None
217
+
218
+ data = load(data_file)
219
+ if 'question' not in [x.lower() for x in data.columns]:
220
+ warnings.warn(f'Data file {data_file} does not have a `question` column. Dataset building failed. ')
221
+ return None
222
+
223
+ if 'A' in data and 'B' in data:
224
+ if 'image' in data or 'image_path' in data:
225
+ warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom MCQ dataset. ')
226
+ return CustomMCQDataset(dataset=dataset_name, **kwargs)
227
+ else:
228
+ warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom Text MCQ dataset. ')
229
+ return CustomTextMCQDataset(dataset=dataset_name, **kwargs)
230
+ else:
231
+ warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom VQA dataset. ')
232
+ return CustomVQADataset(dataset=dataset_name, **kwargs)
233
+
234
+
235
+ __all__ = [
236
+ 'build_dataset', 'img_root_map', 'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
237
+ ] + [cls.__name__ for cls in DATASET_CLASSES]
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.distributed as dist
3
+ from vlmeval.config import supported_VLM
4
+ from vlmeval.utils import track_progress_rich
5
+ from vlmeval.smp import *
6
+
7
+ FAIL_MSG = 'Failed to obtain answer via API.'
8
+
9
+
10
+ def parse_args():
11
+ parser = argparse.ArgumentParser()
12
+ parser.add_argument('--data', type=str, nargs='+', required=True)
13
+ parser.add_argument('--model', type=str, nargs='+', required=True)
14
+ parser.add_argument('--nproc', type=int, default=4, required=True)
15
+ parser.add_argument('--verbose', action='store_true')
16
+ args = parser.parse_args()
17
+ return args
18
+
19
+
20
+ # Only API model is accepted
21
+ def infer_data_api(model, work_dir, model_name, dataset, index_set=None, api_nproc=4, ignore_failed=False):
22
+ rank, world_size = get_rank_and_world_size()
23
+ assert rank == 0 and world_size == 1
24
+ dataset_name = dataset.dataset_name
25
+ data = dataset.data
26
+ if index_set is not None:
27
+ data = data[data['index'].isin(index_set)]
28
+
29
+ model = supported_VLM[model_name]() if isinstance(model, str) else model
30
+ assert getattr(model, 'is_api', False)
31
+ if hasattr(model, 'set_dump_image'):
32
+ model.set_dump_image(dataset.dump_image)
33
+
34
+ lt, indices = len(data), list(data['index'])
35
+
36
+ structs = []
37
+ for i in range(lt):
38
+ item = data.iloc[i]
39
+ if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
40
+ assert hasattr(model, 'build_prompt')
41
+ struct = model.build_prompt(item, dataset=dataset_name)
42
+ else:
43
+ struct = dataset.build_prompt(item)
44
+ structs.append(struct)
45
+
46
+ # structs = [dataset.build_prompt(data.iloc[i]) for i in range(lt)]
47
+
48
+ out_file = f'{work_dir}/{model_name}_{dataset_name}_supp.pkl'
49
+ res = {}
50
+ if osp.exists(out_file):
51
+ res = load(out_file)
52
+ if ignore_failed:
53
+ res = {k: v for k, v in res.items() if FAIL_MSG not in v}
54
+
55
+ structs = [s for i, s in zip(indices, structs) if i not in res]
56
+ indices = [i for i in indices if i not in res]
57
+
58
+ gen_func = model.generate
59
+ structs = [dict(message=struct, dataset=dataset_name) for struct in structs]
60
+
61
+ if len(structs):
62
+ track_progress_rich(gen_func, structs, nproc=api_nproc, chunksize=api_nproc, save=out_file, keys=indices)
63
+
64
+ res = load(out_file)
65
+ if index_set is not None:
66
+ res = {k: v for k, v in res.items() if k in index_set}
67
+ os.remove(out_file)
68
+ return res
69
+
70
+
71
+ def infer_data(model, model_name, work_dir, dataset, out_file, verbose=False, api_nproc=4):
72
+ dataset_name = dataset.dataset_name
73
+ prev_file = f'{work_dir}/{model_name}_{dataset_name}_PREV.pkl'
74
+ res = load(prev_file) if osp.exists(prev_file) else {}
75
+ if osp.exists(out_file):
76
+ res.update(load(out_file))
77
+
78
+ rank, world_size = get_rank_and_world_size()
79
+ sheet_indices = list(range(rank, len(dataset), world_size))
80
+ lt = len(sheet_indices)
81
+ data = dataset.data.iloc[sheet_indices]
82
+ data_indices = [i for i in data['index']]
83
+
84
+ # If finished, will exit without building the model
85
+ all_finished = True
86
+ for i in range(lt):
87
+ idx = data.iloc[i]['index']
88
+ if idx not in res:
89
+ all_finished = False
90
+ if all_finished:
91
+ res = {k: res[k] for k in data_indices}
92
+ dump(res, out_file)
93
+ return
94
+
95
+ # Data need to be inferred
96
+ data = data[~data['index'].isin(res)]
97
+ lt = len(data)
98
+
99
+ model = supported_VLM[model_name]() if isinstance(model, str) else model
100
+
101
+ is_api = getattr(model, 'is_api', False)
102
+ if is_api:
103
+ lt, indices = len(data), list(data['index'])
104
+ supp = infer_data_api(
105
+ model=model,
106
+ work_dir=work_dir,
107
+ model_name=model_name,
108
+ dataset=dataset,
109
+ index_set=set(indices),
110
+ api_nproc=api_nproc)
111
+ for idx in indices:
112
+ assert idx in supp
113
+ res.update(supp)
114
+ res = {k: res[k] for k in data_indices}
115
+ dump(res, out_file)
116
+ return model
117
+ else:
118
+ model.set_dump_image(dataset.dump_image)
119
+
120
+ for i in tqdm(range(lt)):
121
+ idx = data.iloc[i]['index']
122
+ if idx in res:
123
+ continue
124
+
125
+ if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
126
+ struct = model.build_prompt(data.iloc[i], dataset=dataset_name)
127
+ else:
128
+ struct = dataset.build_prompt(data.iloc[i])
129
+
130
+ response = model.generate(message=struct, dataset=dataset_name)
131
+ torch.cuda.empty_cache()
132
+
133
+ if verbose:
134
+ print(response, flush=True)
135
+
136
+ res[idx] = response
137
+ if (i + 1) % 10 == 0:
138
+ dump(res, out_file)
139
+
140
+ res = {k: res[k] for k in data_indices}
141
+ dump(res, out_file)
142
+ return model
143
+
144
+
145
+ # A wrapper for infer_data, do the pre & post processing
146
+ def infer_data_job(model, work_dir, model_name, dataset, verbose=False, api_nproc=4, ignore_failed=False):
147
+ rank, world_size = get_rank_and_world_size()
148
+ dataset_name = dataset.dataset_name
149
+ result_file = osp.join(work_dir, f'{model_name}_{dataset_name}.xlsx')
150
+
151
+ prev_file = f'{work_dir}/{model_name}_{dataset_name}_PREV.pkl'
152
+ if osp.exists(result_file):
153
+ if rank == 0:
154
+ data = load(result_file)
155
+ results = {k: v for k, v in zip(data['index'], data['prediction'])}
156
+ if not ignore_failed:
157
+ results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
158
+ dump(results, prev_file)
159
+ if world_size > 1:
160
+ dist.barrier()
161
+
162
+ tmpl = osp.join(work_dir, '{}' + f'{world_size}_{dataset_name}.pkl')
163
+ out_file = tmpl.format(rank)
164
+
165
+ model = infer_data(
166
+ model=model, work_dir=work_dir, model_name=model_name, dataset=dataset,
167
+ out_file=out_file, verbose=verbose, api_nproc=api_nproc)
168
+ if world_size > 1:
169
+ dist.barrier()
170
+
171
+ if rank == 0:
172
+ data_all = {}
173
+ for i in range(world_size):
174
+ data_all.update(load(tmpl.format(i)))
175
+
176
+ data = dataset.data
177
+ for x in data['index']:
178
+ assert x in data_all
179
+ data['prediction'] = [str(data_all[x]) for x in data['index']]
180
+ if 'image' in data:
181
+ data.pop('image')
182
+
183
+ dump(data, result_file)
184
+ for i in range(world_size):
185
+ os.remove(tmpl.format(i))
186
+ if world_size > 1:
187
+ dist.barrier()
188
+ return model
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_mt.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.distributed as dist
3
+ from vlmeval.config import supported_VLM
4
+ from vlmeval.utils import track_progress_rich
5
+ from vlmeval.smp import *
6
+
7
+ FAIL_MSG = 'Failed to obtain answer via API.'
8
+
9
+
10
+ def parse_args():
11
+ parser = argparse.ArgumentParser()
12
+ parser.add_argument('--data', type=str, nargs='+', required=True)
13
+ parser.add_argument('--model', type=str, nargs='+', required=True)
14
+ parser.add_argument('--nproc', type=int, default=4, required=True)
15
+ parser.add_argument('--verbose', action='store_true')
16
+ args = parser.parse_args()
17
+ return args
18
+
19
+
20
+ def chat_mt(model, messages, dataset_name):
21
+ assert len(messages) % 2 == 0
22
+ nturn = len(messages) // 2
23
+ utter_stack = []
24
+ predictions = []
25
+
26
+ for i in range(nturn):
27
+ utter = messages[2 * i]
28
+ utter_stack.append(utter)
29
+ try:
30
+ resp = model.chat(utter_stack, dataset=dataset_name)
31
+ utter_stack.append(dict(role='assistant', content=resp))
32
+ except Exception as e:
33
+ resp = FAIL_MSG + str(e)
34
+ utter_stack.append(dict(role='assistant', content=resp))
35
+ predictions.append(resp)
36
+ return predictions
37
+
38
+
39
+ # Only API model is accepted
40
+ def infer_data_api(model, work_dir, model_name, dataset, index_set=None, api_nproc=4, ignore_failed=False):
41
+ rank, world_size = get_rank_and_world_size()
42
+ assert rank == 0 and world_size == 1
43
+ dataset_name = dataset.dataset_name
44
+ data = dataset.data
45
+ if index_set is not None:
46
+ data = data[data['index'].isin(index_set)]
47
+
48
+ model = supported_VLM[model_name]() if isinstance(model, str) else model
49
+ assert getattr(model, 'is_api', False)
50
+ assert hasattr(model, 'chat_inner')
51
+
52
+ lt, indices = len(data), list(data['index'])
53
+ structs = [dataset.build_prompt(data.iloc[i]) for i in range(lt)]
54
+
55
+ out_file = f'{work_dir}/{model_name}_{dataset_name}_supp.pkl'
56
+ res = {}
57
+ if osp.exists(out_file):
58
+ res = load(out_file)
59
+ if ignore_failed:
60
+ res = {k: v for k, v in res.items() if FAIL_MSG not in v}
61
+
62
+ structs = [s for i, s in zip(indices, structs) if i not in res]
63
+ indices = [i for i in indices if i not in res]
64
+
65
+ structs = [dict(model=model, messages=struct, dataset_name=dataset_name) for struct in structs]
66
+
67
+ if len(structs):
68
+ track_progress_rich(chat_mt, structs, nproc=api_nproc, chunksize=api_nproc, save=out_file, keys=indices)
69
+
70
+ res = load(out_file)
71
+ if index_set is not None:
72
+ res = {k: v for k, v in res.items() if k in index_set}
73
+ os.remove(out_file)
74
+ return res
75
+
76
+
77
+ def infer_data(model, model_name, work_dir, dataset, out_file, verbose=False, api_nproc=4):
78
+ dataset_name = dataset.dataset_name
79
+ res = {}
80
+ if osp.exists(out_file):
81
+ res.update(load(out_file))
82
+
83
+ rank, world_size = get_rank_and_world_size()
84
+ sheet_indices = list(range(rank, len(dataset), world_size))
85
+ lt = len(sheet_indices)
86
+ data = dataset.data.iloc[sheet_indices]
87
+ data_indices = [i for i in data['index']]
88
+
89
+ # If finished, will exit without building the model
90
+ all_finished = True
91
+ for i in range(lt):
92
+ idx = data.iloc[i]['index']
93
+ if idx not in res:
94
+ all_finished = False
95
+ if all_finished:
96
+ res = {k: res[k] for k in data_indices}
97
+ dump(res, out_file)
98
+ return
99
+
100
+ # Data need to be inferred
101
+ data = data[~data['index'].isin(res)]
102
+ lt = len(data)
103
+
104
+ model = supported_VLM[model_name]() if isinstance(model, str) else model
105
+ assert hasattr(model, 'chat_inner')
106
+
107
+ is_api = getattr(model, 'is_api', False)
108
+ if is_api:
109
+ lt, indices = len(data), list(data['index'])
110
+ supp = infer_data_api(
111
+ model=model,
112
+ work_dir=work_dir,
113
+ model_name=model_name,
114
+ dataset=dataset,
115
+ index_set=set(indices),
116
+ api_nproc=api_nproc)
117
+ for idx in indices:
118
+ assert idx in supp
119
+ res.update(supp)
120
+ res = {k: res[k] for k in data_indices}
121
+ dump(res, out_file)
122
+ return model
123
+ else:
124
+ model.set_dump_image(dataset.dump_image)
125
+
126
+ for i in tqdm(range(lt)):
127
+ idx = data.iloc[i]['index']
128
+ if idx in res:
129
+ continue
130
+
131
+ if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
132
+ struct = model.build_prompt(data.iloc[i], dataset=dataset_name)
133
+ else:
134
+ struct = dataset.build_prompt(data.iloc[i])
135
+
136
+ response = chat_mt(model, struct, dataset_name)
137
+ torch.cuda.empty_cache()
138
+
139
+ if verbose:
140
+ print(response, flush=True)
141
+
142
+ res[idx] = response
143
+ if (i + 1) % 20 == 0:
144
+ dump(res, out_file)
145
+
146
+ res = {k: res[k] for k in data_indices}
147
+ dump(res, out_file)
148
+ return model
149
+
150
+
151
+ # A wrapper for infer_data, do the pre & post processing
152
+ def infer_data_job_mt(model, work_dir, model_name, dataset, verbose=False, api_nproc=4, ignore_failed=False):
153
+ rank, world_size = get_rank_and_world_size()
154
+ dataset_name = dataset.dataset_name
155
+ result_file = osp.join(work_dir, f'{model_name}_{dataset_name}.tsv')
156
+
157
+ tmpl = osp.join(work_dir, '{}' + f'{world_size}_{dataset_name}.pkl')
158
+ out_file = tmpl.format(rank)
159
+
160
+ model = infer_data(
161
+ model=model, model_name=model_name,work_dir=work_dir, dataset=dataset,
162
+ out_file=out_file, verbose=verbose, api_nproc=api_nproc)
163
+ if world_size > 1:
164
+ dist.barrier()
165
+
166
+ if rank == 0:
167
+ data_all = {}
168
+ for i in range(world_size):
169
+ data_all.update(load(tmpl.format(i)))
170
+
171
+ data = dataset.data
172
+ for x in data['index']:
173
+ assert x in data_all
174
+
175
+ data['prediction'] = [data_all[x] for x in data['index']]
176
+ if 'image' in data:
177
+ data.pop('image')
178
+
179
+ dump(data, result_file)
180
+ for i in range(world_size):
181
+ os.remove(tmpl.format(i))
182
+ return model
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_video.py ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.distributed as dist
3
+ from vlmeval.config import supported_VLM
4
+ from vlmeval.utils import track_progress_rich
5
+ from vlmeval.smp import *
6
+
7
+ FAIL_MSG = 'Failed to obtain answer via API.'
8
+
9
+
10
+ def parse_args():
11
+ parser = argparse.ArgumentParser()
12
+ parser.add_argument('--data', type=str, nargs='+', required=True)
13
+ parser.add_argument('--model', type=str, nargs='+', required=True)
14
+ parser.add_argument('--nproc', type=int, default=4, required=True)
15
+ parser.add_argument('--verbose', action='store_true')
16
+ args = parser.parse_args()
17
+ return args
18
+
19
+
20
+ # Only API model is accepted
21
+ def infer_data_api(model, work_dir, model_name, dataset, samples_dict={}, api_nproc=4):
22
+ rank, world_size = get_rank_and_world_size()
23
+ assert rank == 0 and world_size == 1
24
+ dataset_name = dataset.dataset_name
25
+ model = supported_VLM[model_name]() if isinstance(model, str) else model
26
+ assert getattr(model, 'is_api', False)
27
+
28
+ indices = list(samples_dict.keys())
29
+ structs = [dataset.build_prompt(samples_dict[idx], video_llm=getattr(model, 'VIDEO_LLM', False)) for idx in indices]
30
+
31
+ packstr = 'pack' if getattr(dataset, 'pack', False) else 'nopack'
32
+ if dataset.nframe > 0:
33
+ out_file = f'{work_dir}/{model_name}_{dataset_name}_{dataset.nframe}frame_{packstr}_supp.pkl'
34
+ else:
35
+ out_file = f'{work_dir}/{model_name}_{dataset_name}_{dataset.fps}fps_{packstr}_supp.pkl'
36
+ res = load(out_file) if osp.exists(out_file) else {}
37
+
38
+ structs = [s for i, s in zip(indices, structs) if i not in res or res[i] == FAIL_MSG]
39
+ indices = [i for i in indices if i not in res or res[i] == FAIL_MSG]
40
+
41
+ gen_func = model.generate
42
+ structs = [dict(message=struct, dataset=dataset_name) for struct in structs]
43
+
44
+ if len(structs):
45
+ track_progress_rich(gen_func, structs, nproc=api_nproc, chunksize=api_nproc, save=out_file, keys=indices)
46
+
47
+ res = load(out_file)
48
+ return res
49
+
50
+
51
+ def infer_data(model, model_name, work_dir, dataset, out_file, verbose=False, api_nproc=4):
52
+ res = load(out_file) if osp.exists(out_file) else {}
53
+ rank, world_size = get_rank_and_world_size()
54
+ dataset_name = dataset.dataset_name
55
+
56
+ sample_indices = list(dataset.videos) if getattr(dataset, 'pack', False) else list(dataset.data['index'])
57
+ samples = list(dataset.videos) if getattr(dataset, 'pack', False) else list(range(len(dataset.data)))
58
+ sample_map = {i: s for i, s in zip(sample_indices, samples)}
59
+
60
+ sample_indices_sub = sample_indices[rank::world_size]
61
+ if np.all([idx in res for idx in sample_indices_sub]):
62
+ return model
63
+ sample_indices_subrem = [x for x in sample_indices_sub if x not in res]
64
+
65
+ model = supported_VLM[model_name]() if isinstance(model, str) else model
66
+
67
+ is_api = getattr(model, 'is_api', False)
68
+ if is_api:
69
+ assert world_size == 1
70
+ supp = infer_data_api(
71
+ model=model,
72
+ work_dir=work_dir,
73
+ model_name=model_name,
74
+ dataset=dataset,
75
+ samples_dict={k: sample_map[k] for k in sample_indices_subrem},
76
+ api_nproc=api_nproc)
77
+ for k in sample_indices_subrem:
78
+ assert k in supp
79
+ res.update(supp)
80
+ dump(res, out_file)
81
+ return model
82
+
83
+ assert not getattr(dataset, 'pack', False), 'Current model not supported pack mode!'
84
+ for i, idx in tqdm(enumerate(sample_indices_subrem)):
85
+ if idx in res:
86
+ continue
87
+ if getattr(model, 'nframe', None) is not None and getattr(model, 'nframe', 0) > 0:
88
+ if dataset.nframe > 0:
89
+ if getattr(model, 'nframe', 0) != dataset.nframe:
90
+ print(f'{model_name} is a video-llm model, nframe is set to {dataset.nframe}, not using default')
91
+ setattr(model, 'nframe', dataset.nframe)
92
+ elif getattr(model, 'fps', 0) == 0:
93
+ raise ValueError(f'fps is not suitable for {model_name}')
94
+ else:
95
+ setattr(model, 'nframe', None)
96
+ if getattr(model, 'fps', None) is not None and getattr(model, 'fps', 0) > 0:
97
+ if dataset.fps > 0:
98
+ if getattr(model, 'fps', 0) != dataset.fps:
99
+ print(f'{model_name} is a video-llm model, fps is set to {dataset.fps}, not using default')
100
+ setattr(model, 'fps', dataset.fps)
101
+ elif getattr(model, 'nframe', 0) == 0:
102
+ raise ValueError(f'nframe is not suitable for {model_name}')
103
+ else:
104
+ setattr(model, 'fps', None)
105
+ if 'SUB_DATASET' in dataset.data.iloc[sample_map[idx]]:
106
+ dataset_name = dataset.data.iloc[sample_map[idx]]['SUB_DATASET']
107
+ if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
108
+ if dataset.nframe == 0:
109
+ raise ValueError(f'nframe must be set for custom prompt, fps is not suitable for {model_name}')
110
+ struct = model.build_prompt(
111
+ dataset.data.iloc[sample_map[idx]], dataset=dataset, video_llm=getattr(model, 'VIDEO_LLM', False)
112
+ )
113
+ else:
114
+ struct = dataset.build_prompt(
115
+ sample_map[idx], video_llm=getattr(model, 'VIDEO_LLM', False)
116
+ )
117
+ response = model.generate(message=struct, dataset=dataset_name)
118
+ torch.cuda.empty_cache()
119
+
120
+ if verbose:
121
+ print(response, flush=True)
122
+
123
+ res[idx] = response
124
+ if (i + 1) % 20 == 0:
125
+ dump(res, out_file)
126
+
127
+ res = {k: res[k] for k in sample_indices_sub}
128
+ dump(res, out_file)
129
+ return model
130
+
131
+
132
+ # A wrapper for infer_data, do the pre & post processing
133
+ def infer_data_job_video(
134
+ model,
135
+ work_dir,
136
+ model_name,
137
+ dataset,
138
+ result_file_name,
139
+ verbose=False,
140
+ api_nproc=4):
141
+
142
+ dataset_name = dataset.dataset_name
143
+ rank, world_size = get_rank_and_world_size()
144
+ result_file = osp.join(work_dir, result_file_name)
145
+ # Dump Predictions to Prev File if result file exists
146
+ if osp.exists(result_file):
147
+ return model
148
+
149
+ tmpl = osp.join(work_dir, '{}' + f'{world_size}_{osp.splitext(result_file_name)[0]}.pkl')
150
+ out_file = tmpl.format(rank)
151
+
152
+ model = infer_data(
153
+ model=model,
154
+ model_name=model_name,
155
+ work_dir=work_dir,
156
+ dataset=dataset,
157
+ out_file=out_file,
158
+ verbose=verbose,
159
+ api_nproc=api_nproc)
160
+
161
+ if world_size > 1:
162
+ dist.barrier()
163
+
164
+ if rank == 0:
165
+ data_all = {}
166
+ for i in range(world_size):
167
+ data_all.update(load(tmpl.format(i)))
168
+
169
+ meta = dataset.data
170
+ if dataset_name == 'MMBench-Video' and getattr(dataset, 'pack', False):
171
+ meta, vstats = dataset.load_pack_answers(data_all)
172
+ print(f'Statitics of Pack Video Inference: {vstats}')
173
+ else:
174
+ for x in meta['index']:
175
+ assert x in data_all
176
+ meta['prediction'] = [str(data_all[x]) for x in meta['index']]
177
+ if 'image' in meta:
178
+ meta.pop('image')
179
+
180
+ dump(meta, result_file)
181
+ for i in range(world_size):
182
+ os.remove(tmpl.format(i))
183
+ return model
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/tools.py ADDED
@@ -0,0 +1,468 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ from vlmeval.dataset import SUPPORTED_DATASETS
3
+ from vlmeval.config import *
4
+ from vlmeval.smp import *
5
+
6
+ # Define valid modes
7
+ MODES = ('dlist', 'mlist', 'missing', 'circular', 'localize', 'check', 'run', 'eval', 'merge_pkl')
8
+
9
+ CLI_HELP_MSG = \
10
+ f"""
11
+ Arguments received: {str(['vlmutil'] + sys.argv[1:])}. vlmutil commands use the following syntax:
12
+
13
+ vlmutil MODE MODE_ARGS
14
+
15
+ Where MODE (required) is one of {MODES}
16
+ MODE_ARG (optional) is the argument for specific mode
17
+
18
+ Some usages for xtuner commands: (See more by using -h for specific command!)
19
+
20
+ 1. List all the dataset by levels: l1, l2, l3, etc.:
21
+ vlmutil dlist [l1/l2/l3/...]
22
+ 2. List all the models by categories: 4.33.0, 4.37.0, api, etc.:
23
+ vlmutil mlist 4.33.0 [all/small/large]
24
+ 3. Report missing results:
25
+ vlmutil missing [l1/l2/l3/...]
26
+ 4. Create circular questions (only for multiple-choice questions with no more than 4 choices):
27
+ vlmutil circular input.tsv
28
+ 5. Create a localized version of the dataset (for very large tsv files):
29
+ vlmutil localize input.tsv
30
+ 6. Check the validity of a model:
31
+ vlmutil check [model_name/model_series]
32
+ 7. Run evaluation for missing results:
33
+ vlmutil run l2 hf
34
+ 8. Evaluate data file:
35
+ vlmutil eval [dataset_name] [prediction_file]
36
+ 9. Merge pkl files:
37
+ vlmutil merge_pkl [pkl_dir] [world_size]
38
+
39
+ GitHub: https://github.com/open-compass/VLMEvalKit
40
+ """ # noqa: E501
41
+
42
+
43
+ dataset_levels = {
44
+ 'l1': [
45
+ ('MMVet', 'gpt-4-turbo_score.csv'), ('MMMU_DEV_VAL', 'acc.csv'),
46
+ ('MathVista_MINI', 'gpt-4-turbo_score.csv'), ('HallusionBench', 'score.csv'),
47
+ ('OCRBench', 'score.json'), ('AI2D_TEST', 'acc.csv'), ('MMStar', 'acc.csv'),
48
+ ('MMBench_V11', 'acc.csv'), ('MMBench_CN_V11', 'acc.csv')
49
+ ],
50
+ 'l2': [
51
+ ('MME', 'score.csv'), ('LLaVABench', 'score.csv'), ('RealWorldQA', 'acc.csv'),
52
+ ('MMBench', 'acc.csv'), ('MMBench_CN', 'acc.csv'), ('CCBench', 'acc.csv'),
53
+ ('SEEDBench_IMG', 'acc.csv'), ('COCO_VAL', 'score.json'), ('POPE', 'score.csv'),
54
+ ('ScienceQA_VAL', 'acc.csv'), ('ScienceQA_TEST', 'acc.csv'), ('MMT-Bench_VAL', 'acc.csv'),
55
+ ('SEEDBench2_Plus', 'acc.csv'), ('BLINK', 'acc.csv'), ('MTVQA_TEST', 'acc.json'),
56
+ ('Q-Bench1_VAL', 'acc.csv'), ('A-Bench_VAL', 'acc.csv'), ('R-Bench-Dis', 'acc.csv'),
57
+ ('MathVision', 'score.csv'), ('MathVerse_MINI_Vision_Only', 'score.csv'), ('DynaMath', 'score.csv'),
58
+ ],
59
+ 'l3': [
60
+ ('OCRVQA_TESTCORE', 'acc.csv'), ('TextVQA_VAL', 'acc.csv'),
61
+ ('ChartQA_TEST', 'acc.csv'), ('DocVQA_VAL', 'acc.csv'), ('InfoVQA_VAL', 'acc.csv'),
62
+ ('SEEDBench2', 'acc.csv')
63
+ ]
64
+ }
65
+
66
+ dataset_levels['l12'] = dataset_levels['l1'] + dataset_levels['l2']
67
+ dataset_levels['l23'] = dataset_levels['l2'] + dataset_levels['l3']
68
+ dataset_levels['l123'] = dataset_levels['l12'] + dataset_levels['l3']
69
+
70
+ models = {
71
+ '4.37.0': ['MiniCPM-V', 'MiniCPM-V-2'],
72
+ '4.40.0': ['MiniCPM-Llama3-V-2_5'],
73
+ 'latest': ['MiniCPM-V-2_6']
74
+ }
75
+
76
+ # SKIP_MODELS will be skipped in report_missing and run APIs
77
+ SKIP_MODELS = ['MiniCPM-V']
78
+
79
+ def completed(m, d, suf):
80
+ score_file = f'outputs/{m}/{m}_{d}_{suf}'
81
+ if osp.exists(score_file):
82
+ return True
83
+ if d == 'MMBench':
84
+ s1, s2 = f'outputs/{m}/{m}_MMBench_DEV_EN_{suf}', f'outputs/{m}/{m}_MMBench_TEST_EN_{suf}'
85
+ return osp.exists(s1) and osp.exists(s2)
86
+ elif d == 'MMBench_CN':
87
+ s1, s2 = f'outputs/{m}/{m}_MMBench_DEV_CN_{suf}', f'outputs/{m}/{m}_MMBench_TEST_CN_{suf}'
88
+ return osp.exists(s1) and osp.exists(s2)
89
+ return False
90
+
91
+
92
+ def DLIST(lvl):
93
+ if lvl in dataset_levels.keys():
94
+ return [x[0] for x in dataset_levels[lvl]]
95
+ else:
96
+ from vlmeval.dataset import SUPPORTED_DATASETS
97
+ return SUPPORTED_DATASETS
98
+
99
+
100
+ def MLIST(lvl, size='all'):
101
+ if lvl == 'all':
102
+ from vlmeval.config import supported_VLM
103
+ return [x for x in supported_VLM]
104
+
105
+ model_list = models[lvl]
106
+ if size == 'small':
107
+ model_list = [m for m in model_list if m not in LARGE_MODELS]
108
+ elif size == 'large':
109
+ model_list = [m for m in model_list if m in LARGE_MODELS]
110
+ return [x[0] for x in model_list]
111
+
112
+
113
+ def MISSING(lvl):
114
+ from vlmeval.config import supported_VLM
115
+ models = list(supported_VLM)
116
+ models = [m for m in models if m not in SKIP_MODELS and osp.exists(osp.join('outputs', m))]
117
+ if lvl in dataset_levels.keys():
118
+ data_list = dataset_levels[lvl]
119
+ else:
120
+ data_list = [(D, suff) for (D, suff) in dataset_levels['l123'] if D == lvl]
121
+ missing_list = []
122
+ for f in models:
123
+ for D, suff in data_list:
124
+ if not completed(f, D, suff):
125
+ missing_list.append((f, D))
126
+ return missing_list
127
+
128
+
129
+ def CIRCULAR(inp):
130
+ assert inp.endswith('.tsv')
131
+ data = load(inp)
132
+ OFFSET = 1e6
133
+ while max(data['index']) >= OFFSET:
134
+ OFFSET *= 10
135
+
136
+ assert 'E' not in data, 'Currently build_circular only works for up to 4-choice questions'
137
+ data_2c = data[pd.isna(data['C'])]
138
+ data_3c = data[~pd.isna(data['C']) & pd.isna(data['D'])]
139
+ data_4c = data[~pd.isna(data['D'])]
140
+ map_2c = [('AB', 'BA')]
141
+ map_3c = [('ABC', 'BCA'), ('ABC', 'CAB')]
142
+ map_4c = [('ABCD', 'BCDA'), ('ABCD', 'CDAB'), ('ABCD', 'DABC')]
143
+
144
+ def okn(o, n=4):
145
+ ostr = o.replace(',', ' ')
146
+ osplits = ostr.split()
147
+ if sum([c in osplits for c in string.ascii_uppercase[:n - 1]]) == n - 1:
148
+ return False
149
+ olower = o.lower()
150
+ olower = olower.replace(',', ' ')
151
+ olower_splits = olower.split()
152
+ if 'all' in olower_splits or 'none' in olower_splits:
153
+ return False
154
+ return True
155
+
156
+ yay4, nay4 = [], []
157
+ lt4 = len(data_4c)
158
+ for i in range(lt4):
159
+ if okn(data_4c.iloc[i]['D'], 4):
160
+ yay4.append(i)
161
+ else:
162
+ nay4.append(i)
163
+ data_4c_y = data_4c.iloc[yay4]
164
+ data_4c_n = data_4c.iloc[nay4]
165
+ data_3c = pd.concat([data_4c_n, data_3c])
166
+
167
+ yay3, nay3 = [], []
168
+ lt3 = len(data_3c)
169
+ for i in range(lt3):
170
+ if okn(data_3c.iloc[i]['C'], 3):
171
+ yay3.append(i)
172
+ else:
173
+ nay3.append(i)
174
+ data_3c_y = data_3c.iloc[yay3]
175
+ data_3c_n = data_3c.iloc[nay3]
176
+ data_2c = pd.concat([data_3c_n, data_2c])
177
+
178
+ def remap(data_in, tup, off):
179
+ off = int(off)
180
+ data = data_in.copy()
181
+ char_map = {k: v for k, v in zip(*tup)}
182
+ idx = data.pop('index')
183
+ answer = data.pop('answer')
184
+ answer_new = [char_map[x] if x in char_map else x for x in answer]
185
+ data['answer'] = answer_new
186
+ options = {}
187
+ for c in char_map:
188
+ options[char_map[c]] = data.pop(c)
189
+ for c in options:
190
+ data[c] = options[c]
191
+ data.pop('image')
192
+ data['image'] = idx
193
+ idx = [x + off for x in idx]
194
+ data['index'] = idx
195
+ return data
196
+
197
+ data_all = pd.concat([
198
+ data_2c,
199
+ data_3c_y,
200
+ data_4c_y,
201
+ remap(data_2c, map_2c[0], OFFSET),
202
+ remap(data_3c_y, map_3c[0], OFFSET),
203
+ remap(data_4c_y, map_4c[0], OFFSET),
204
+ remap(data_3c_y, map_3c[1], OFFSET * 2),
205
+ remap(data_4c_y, map_4c[1], OFFSET * 2),
206
+ remap(data_4c_y, map_4c[2], OFFSET * 3),
207
+ ])
208
+
209
+ tgt_file = inp.replace('.tsv', '_CIRC.tsv')
210
+ dump(data_all, tgt_file)
211
+ print(f'The circularized data is saved to {tgt_file}')
212
+ assert osp.exists(tgt_file)
213
+ print(f'The MD5 for the circularized data is {md5(tgt_file)}')
214
+
215
+
216
+ PTH = osp.realpath(__file__)
217
+ IMAGE_PTH = osp.join(osp.dirname(PTH), '../assets/apple.jpg')
218
+
219
+ msg1 = [
220
+ IMAGE_PTH,
221
+ 'What is in this image?'
222
+ ]
223
+ msg2 = [
224
+ dict(type='image', value=IMAGE_PTH),
225
+ dict(type='text', value='What is in this image?')
226
+ ]
227
+ msg3 = [
228
+ IMAGE_PTH,
229
+ IMAGE_PTH,
230
+ 'How many apples are there in these images?'
231
+ ]
232
+ msg4 = [
233
+ dict(type='image', value=IMAGE_PTH),
234
+ dict(type='image', value=IMAGE_PTH),
235
+ dict(type='text', value='How many apples are there in these images?')
236
+ ]
237
+
238
+
239
+ def CHECK(val):
240
+ if val in supported_VLM:
241
+ model = supported_VLM[val]()
242
+ print(f'Model: {val}')
243
+ for i, msg in enumerate([msg1, msg2, msg3, msg4]):
244
+ if i > 1 and not model.INTERLEAVE:
245
+ continue
246
+ res = model.generate(msg)
247
+ print(f'Test {i + 1}: {res}')
248
+ elif val in models:
249
+ model_list = models[val]
250
+ for m in model_list:
251
+ CHECK(m)
252
+
253
+
254
+ def LOCALIZE(fname, new_fname=None):
255
+ if new_fname is None:
256
+ new_fname = fname.replace('.tsv', '_local.tsv')
257
+
258
+ base_name = osp.basename(fname)
259
+ dname = osp.splitext(base_name)[0]
260
+
261
+ data = load(fname)
262
+ data_new = localize_df(data, dname)
263
+ dump(data_new, new_fname)
264
+ print(f'The localized version of data file is {new_fname}')
265
+ return new_fname
266
+
267
+
268
+ def RUN(lvl, model):
269
+ import torch
270
+ NGPU = torch.cuda.device_count()
271
+ SCRIPT = osp.join(osp.dirname(__file__), '../run.py')
272
+ logger = get_logger('Run Missing')
273
+
274
+ def get_env(name):
275
+ assert name in ['433', '437', '440', 'latest']
276
+ load_env()
277
+ env_key = f'ENV_{name}'
278
+ return os.environ.get(env_key, None)
279
+
280
+ missing = MISSING(lvl)
281
+ if model == 'all':
282
+ pass
283
+ elif model == 'api':
284
+ missing = [x for x in missing if x[0] in models['api']]
285
+ elif model == 'hf':
286
+ missing = [x for x in missing if x[0] not in models['api']]
287
+ elif model in models:
288
+ missing = [x for x in missing if x[0] in models[missing]]
289
+ elif model in supported_VLM:
290
+ missing = [x for x in missing if x[0] == model]
291
+ else:
292
+ warnings.warn(f'Invalid model {model}.')
293
+
294
+ missing.sort(key=lambda x: x[0])
295
+ groups = defaultdict(list)
296
+ for m, D in missing:
297
+ groups[m].append(D)
298
+ for m in groups:
299
+ if m in SKIP_MODELS:
300
+ continue
301
+ for dataset in groups[m]:
302
+ logger.info(f'Running {m} on {dataset}')
303
+ exe = 'python' if m in LARGE_MODELS or m in models['api'] else 'torchrun'
304
+ if m not in models['api']:
305
+ env = None
306
+ env = 'latest' if m in models['latest'] else env
307
+ env = '433' if m in models['4.33.0'] else env
308
+ env = '437' if m in models['4.37.0'] else env
309
+ env = '440' if m in models['4.40.0'] else env
310
+ if env is None:
311
+ # Not found, default to latest
312
+ env = 'latest'
313
+ logger.warning(
314
+ f"Model {m} does not have a specific environment configuration. Defaulting to 'latest'.")
315
+ pth = get_env(env)
316
+ if pth is not None:
317
+ exe = osp.join(pth, 'bin', exe)
318
+ else:
319
+ logger.warning(f'Cannot find the env path {env} for model {m}')
320
+ if exe.endswith('torchrun'):
321
+ cmd = f'{exe} --nproc-per-node={NGPU} {SCRIPT} --model {m} --data {dataset}'
322
+ elif exe.endswith('python'):
323
+ cmd = f'{exe} {SCRIPT} --model {m} --data {dataset}'
324
+ os.system(cmd)
325
+
326
+
327
+ def EVAL(dataset_name, data_file, **kwargs):
328
+ from vlmeval.dataset import build_dataset
329
+ logger = get_logger('VLMEvalKit Tool-Eval')
330
+ dataset = build_dataset(dataset_name)
331
+ # Set the judge kwargs first before evaluation or dumping
332
+ judge_kwargs = {'nproc': 4, 'verbose': True}
333
+ if 'model' not in kwargs:
334
+ if dataset.TYPE in ['MCQ', 'Y/N']:
335
+ judge_kwargs['model'] = 'chatgpt-0125'
336
+ elif listinstr(['MMVet', 'LLaVABench', 'MMBench-Video'], dataset_name):
337
+ judge_kwargs['model'] = 'gpt-4-turbo'
338
+ elif listinstr(['MMLongBench', 'MMDU'], dataset_name):
339
+ judge_kwargs['model'] = 'gpt-4o'
340
+ elif listinstr(['DynaMath', 'MathVerse', 'MathVista', 'MathVision'], dataset_name):
341
+ judge_kwargs['model'] = 'gpt-4o-mini'
342
+ else:
343
+ judge_kwargs['model'] = kwargs['model']
344
+ judge_kwargs['nproc'] = kwargs.get('nproc', 4)
345
+ eval_results = dataset.evaluate(data_file, **judge_kwargs)
346
+ if eval_results is not None:
347
+ assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
348
+ logger.info('Evaluation Results:')
349
+ if isinstance(eval_results, dict):
350
+ logger.info('\n' + json.dumps(eval_results, indent=4))
351
+ elif isinstance(eval_results, pd.DataFrame):
352
+ logger.info('\n')
353
+ logger.info(tabulate(eval_results.T) if len(eval_results) < len(eval_results.columns) else eval_results)
354
+ return eval_results
355
+
356
+
357
+ def parse_args_eval():
358
+ parser = argparse.ArgumentParser()
359
+ # Essential Args, Setting the Names of Datasets and Models
360
+ parser.add_argument('cmd', type=str)
361
+ parser.add_argument('data_file', type=str)
362
+ parser.add_argument('--judge', type=str, default=None)
363
+ parser.add_argument('--nproc', type=int, default=4)
364
+ parser.add_argument('--retry', type=int, default=None)
365
+ args = parser.parse_args()
366
+ return args
367
+
368
+
369
+ def MERGE_PKL(pkl_dir, world_size=1):
370
+ prefs = []
371
+ for ws in list(range(1, 9)):
372
+ prefs.extend([f'{i}{ws}_' for i in range(ws)])
373
+ prefs = set(prefs)
374
+ files = os.listdir(pkl_dir)
375
+ files = [x for x in files if x[:3] in prefs]
376
+ # Merge the files
377
+ res_all = defaultdict(dict)
378
+ for f in files:
379
+ full_path = osp.join(pkl_dir, f)
380
+ key = f[3:]
381
+ res_all[key].update(load(full_path))
382
+ os.remove(full_path)
383
+
384
+ dump_prefs = [f'{i}{world_size}_' for i in range(world_size)]
385
+ for k in res_all:
386
+ for pf in dump_prefs:
387
+ dump(res_all[k], f'{pkl_dir}/{pf}{k}')
388
+ print(f'Merged {len(res_all[k])} records into {pkl_dir}/{dump_prefs[0]}{k}')
389
+
390
+
391
+ def cli():
392
+ logger = get_logger('VLMEvalKit Tools')
393
+ args = sys.argv[1:]
394
+ if not args: # no arguments passed
395
+ logger.info(CLI_HELP_MSG)
396
+ return
397
+
398
+ if args[0].lower() == 'dlist':
399
+ assert len(args) >= 2
400
+ lst = DLIST(args[1])
401
+ print(' '.join(lst))
402
+ elif args[0].lower() == 'mlist':
403
+ assert len(args) >= 2
404
+ size = 'all'
405
+ if len(args) > 2:
406
+ size = args[2].lower()
407
+ lst = MLIST(args[1], size)
408
+ print('\n'.join(lst))
409
+ elif args[0].lower() == 'missing':
410
+ assert len(args) >= 2
411
+ missing_list = MISSING(args[1])
412
+ logger = get_logger('Find Missing')
413
+ logger.info(colored(f'Level {args[1]} Missing Results: ', 'red'))
414
+ lines = []
415
+ for m, D in missing_list:
416
+ line = f'Model {m}, Dataset {D}'
417
+ logger.info(colored(line, 'red'))
418
+ lines.append(line)
419
+ mwlines(lines, f'{args[1]}_missing.txt')
420
+ elif args[0].lower() == 'circular':
421
+ assert len(args) >= 2
422
+ CIRCULAR(args[1])
423
+ elif args[0].lower() == 'localize':
424
+ assert len(args) >= 2
425
+ LOCALIZE(args[1])
426
+ elif args[0].lower() == 'check':
427
+ assert len(args) >= 2
428
+ model_list = args[1:]
429
+ for m in model_list:
430
+ CHECK(m)
431
+ elif args[0].lower() == 'run':
432
+ assert len(args) >= 2
433
+ lvl = args[1]
434
+ if len(args) == 2:
435
+ model = 'all'
436
+ RUN(lvl, model)
437
+ else:
438
+ for model in args[2:]:
439
+ RUN(lvl, model)
440
+ elif args[0].lower() == 'eval':
441
+ args = parse_args_eval()
442
+ data_file = args.data_file
443
+
444
+ def extract_dataset(file_name):
445
+ fname = osp.splitext(file_name)[0].split('/')[-1]
446
+ parts = fname.split('_')
447
+ for i in range(len(parts)):
448
+ if '_'.join(parts[i:]) in SUPPORTED_DATASETS:
449
+ return '_'.join(parts[i:])
450
+ return None
451
+
452
+ dataset = extract_dataset(data_file)
453
+ assert dataset is not None, f'Cannot infer dataset name from {data_file}'
454
+ kwargs = {'nproc': args.api_nproc}
455
+ if args.judge is not None:
456
+ kwargs['model'] = args.judge
457
+ if args.retry is not None:
458
+ kwargs['retry'] = args.retry
459
+ EVAL(dataset_name=dataset, data_file=data_file, **kwargs)
460
+ elif args[0].lower() == 'merge_pkl':
461
+ assert len(args) == 3
462
+ args[2] = int(args[2])
463
+ assert args[2] in [1, 2, 4, 8]
464
+ MERGE_PKL(args[1], args[2])
465
+ else:
466
+ logger.error('WARNING: command error!')
467
+ logger.info(CLI_HELP_MSG)
468
+ return