Add files using upload-large-folder tool
Browse files- r1-a/response_generation/minicpm/MiniCPM-o/docs/wechat.md +6 -0
- r1-a/response_generation/minicpm/MiniCPM-o/docs/xinference_infer.md +67 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README.md +543 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README_zh.md +537 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/.env +28 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements.txt +30 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements/docs.txt +11 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/run.py +424 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/scripts/run_inference.sh +41 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/setup.py +122 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/__init__.py +16 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/__init__.py +5 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/base.py +289 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/gpt.py +267 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/config.py +20 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/dataset/__init__.py +237 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference.py +188 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_mt.py +182 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_video.py +183 -0
- r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/tools.py +468 -0
r1-a/response_generation/minicpm/MiniCPM-o/docs/wechat.md
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<div align="center">
|
| 2 |
+
<img src="../assets/wechat-QR.jpeg" width="60%"/>
|
| 3 |
+
|
| 4 |
+
<p> 扫码加入「MiniCPM-o 交流群」 </p>
|
| 5 |
+
<p> Scan the QR code to join the "MiniCPM-o Discussion Group" </p>
|
| 6 |
+
</div>
|
r1-a/response_generation/minicpm/MiniCPM-o/docs/xinference_infer.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Xinference Infer
|
| 2 |
+
Xinference is a unified inference platform that provides a unified interface for different inference engines. It supports LLM, text generation, image generation, and more.but it's not bigger than Swift too much.
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
### Xinference install
|
| 6 |
+
Xinference can be installed simply by using the following easy bash code:
|
| 7 |
+
```shell
|
| 8 |
+
pip install "xinference[all]"
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
### Quick start
|
| 12 |
+
The initial steps for conducting inference with Xinference involve downloading the model during the first launch.
|
| 13 |
+
1. Start Xinference in the terminal:
|
| 14 |
+
```shell
|
| 15 |
+
xinference
|
| 16 |
+
```
|
| 17 |
+
2. Start the web ui.
|
| 18 |
+
3. Search for "MiniCPM-Llama3-V-2_5" in the search box.
|
| 19 |
+
|
| 20 |
+

|
| 21 |
+
|
| 22 |
+
4. Find and click the MiniCPM-Llama3-V-2_5 button.
|
| 23 |
+
5. Follow the config and launch the model.
|
| 24 |
+
```plaintext
|
| 25 |
+
Model engine : Transformers
|
| 26 |
+
model format : pytorch
|
| 27 |
+
Model size : 8
|
| 28 |
+
quantization : none
|
| 29 |
+
N-GPU : auto
|
| 30 |
+
Replica : 1
|
| 31 |
+
```
|
| 32 |
+
6. After first click the launch button,xinference will download the model from huggingface. We should click the webui button.
|
| 33 |
+
|
| 34 |
+

|
| 35 |
+
|
| 36 |
+
7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
|
| 37 |
+
|
| 38 |
+
### Local MiniCPM-Llama3-V-2_5 Launch
|
| 39 |
+
If you have already downloaded the MiniCPM-Llama3-V-2_5 model locally, you can proceed with Xinference inference following these steps:
|
| 40 |
+
1. Start Xinference
|
| 41 |
+
```shell
|
| 42 |
+
xinference
|
| 43 |
+
```
|
| 44 |
+
2. Start the web ui.
|
| 45 |
+
3. To register a new model, follow these steps: the settings highlighted in red are fixed and cannot be changed, whereas others are customizable according to your needs. Complete the process by clicking the 'Register Model' button.
|
| 46 |
+
|
| 47 |
+

|
| 48 |
+

|
| 49 |
+
|
| 50 |
+
4. After completing the model registration, proceed to 'Custom Models' and locate the model you just registered.
|
| 51 |
+
5. Follow the config and launch the model.
|
| 52 |
+
```plaintext
|
| 53 |
+
Model engine : Transformers
|
| 54 |
+
model format : pytorch
|
| 55 |
+
Model size : 8
|
| 56 |
+
quantization : none
|
| 57 |
+
N-GPU : auto
|
| 58 |
+
Replica : 1
|
| 59 |
+
```
|
| 60 |
+
6. After first click the launch button,Xinference will download the model from Huggingface. we should click the chat button.
|
| 61 |
+

|
| 62 |
+
7. Upload the image and chatting with the MiniCPM-Llama3-V-2_5
|
| 63 |
+
|
| 64 |
+
### FAQ
|
| 65 |
+
1. Why can't the sixth step open the WebUI?
|
| 66 |
+
|
| 67 |
+
Maybe your firewall or mac os to prevent the web to open.
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README.md
ADDED
|
@@ -0,0 +1,543 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluation
|
| 2 |
+
|
| 3 |
+
## MiniCPM-o 2.6
|
| 4 |
+
|
| 5 |
+
### opencompass
|
| 6 |
+
First, enter the `vlmevalkit` directory and install all dependencies:
|
| 7 |
+
```bash
|
| 8 |
+
cd vlmevalkit
|
| 9 |
+
pip install --upgrade pip
|
| 10 |
+
pip install -e .
|
| 11 |
+
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
| 12 |
+
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
| 13 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 14 |
+
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 15 |
+
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 16 |
+
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 17 |
+
```
|
| 18 |
+
<br />
|
| 19 |
+
|
| 20 |
+
Then, run `scripts/run_inference.sh`, which receives two input parameters in sequence: `MODELNAME` and `DATALIST`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference:
|
| 21 |
+
```bash
|
| 22 |
+
chmod +x ./scripts/run_inference.sh
|
| 23 |
+
./scripts/run_inference.sh $MODELNAME $DATALIST
|
| 24 |
+
```
|
| 25 |
+
<br />
|
| 26 |
+
|
| 27 |
+
The five available choices for `MODELNAME` are listed in `vlmeval/config.py`:
|
| 28 |
+
```bash
|
| 29 |
+
minicpm_series = {
|
| 30 |
+
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 31 |
+
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 32 |
+
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 33 |
+
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
| 34 |
+
'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
|
| 35 |
+
}
|
| 36 |
+
```
|
| 37 |
+
<br />
|
| 38 |
+
|
| 39 |
+
All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on multiple datasets at a time, separate the names of different datasets with spaces and add quotation marks at both ends:
|
| 40 |
+
```bash
|
| 41 |
+
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
|
| 42 |
+
```
|
| 43 |
+
<br />
|
| 44 |
+
|
| 45 |
+
When the benchmark requires GPT series model for scoring, please specify `OPENAI_API_BASE` and `OPENAI_API_KEY` in the `.env` file.
|
| 46 |
+
In order to reproduce the results on OpenCompass benchmarks together with ChartQA and MME, which are displayed in the table on the homepage (columns between OCRBench and HallusionBench), you need to run the script according to the following settings:
|
| 47 |
+
```bash
|
| 48 |
+
# Please note that we use different prompts for the perception and reasoning sets of MME. While evaluating on the reasoning subset, CoT is required, so you need to manually modify the judgment condition of the use_cot function in vlmeval/vlm/minicpm_v.py
|
| 49 |
+
./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
|
| 50 |
+
```
|
| 51 |
+
<br />
|
| 52 |
+
|
| 53 |
+
### vqadataset
|
| 54 |
+
First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
|
| 55 |
+
```bash
|
| 56 |
+
cd vqaeval
|
| 57 |
+
pip install -r requirements.txt
|
| 58 |
+
mkdir downloads
|
| 59 |
+
```
|
| 60 |
+
<br />
|
| 61 |
+
|
| 62 |
+
Download the datasets from the following links and place it in the specified directories:
|
| 63 |
+
###### TextVQA
|
| 64 |
+
```bash
|
| 65 |
+
cd downloads
|
| 66 |
+
mkdir TextVQA && cd TextVQA
|
| 67 |
+
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
| 68 |
+
unzip train_val_images.zip && rm train_val_images.zip
|
| 69 |
+
mv train_val_images/train_images . && rm -rf train_val_images
|
| 70 |
+
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
| 71 |
+
cd ../..
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
###### DocVQA / DocVQATest
|
| 75 |
+
|
| 76 |
+
```bash
|
| 77 |
+
cd downloads
|
| 78 |
+
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
| 79 |
+
# Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
|
| 80 |
+
# Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
|
| 81 |
+
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
| 82 |
+
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
| 83 |
+
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
| 84 |
+
cd ../..
|
| 85 |
+
```
|
| 86 |
+
<br />
|
| 87 |
+
|
| 88 |
+
The `downloads` directory should be organized according to the following structure:
|
| 89 |
+
```bash
|
| 90 |
+
downloads
|
| 91 |
+
├── TextVQA
|
| 92 |
+
│ ├── train_images
|
| 93 |
+
│ │ ├── ...
|
| 94 |
+
│ ├── TextVQA_0.5.1_val.json
|
| 95 |
+
├── DocVQA
|
| 96 |
+
│ ├── spdocvqa_images
|
| 97 |
+
│ │ ├── ...
|
| 98 |
+
│ ├── val_v1.0_withQT.json
|
| 99 |
+
│ ├── test_v1.0.json
|
| 100 |
+
```
|
| 101 |
+
<br />
|
| 102 |
+
|
| 103 |
+
Modify the parameters in `shell/run_inference.sh` and run inference:
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
chmod +x ./shell/run_inference.sh
|
| 107 |
+
./shell/run_inference.sh
|
| 108 |
+
```
|
| 109 |
+
<br />
|
| 110 |
+
|
| 111 |
+
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
|
| 112 |
+
For `MiniCPM-o-2_6`, set `model_name` to `minicpmo26`:
|
| 113 |
+
```bash
|
| 114 |
+
# path to images and their corresponding questions
|
| 115 |
+
# TextVQA
|
| 116 |
+
--textVQA_image_dir
|
| 117 |
+
--textVQA_ann_path
|
| 118 |
+
# DocVQA
|
| 119 |
+
--docVQA_image_dir
|
| 120 |
+
--docVQA_ann_path
|
| 121 |
+
# DocVQATest
|
| 122 |
+
--docVQATest_image_dir
|
| 123 |
+
--docVQATest_ann_path
|
| 124 |
+
|
| 125 |
+
# whether to eval on certain task
|
| 126 |
+
--eval_textVQA
|
| 127 |
+
--eval_docVQA
|
| 128 |
+
--eval_docVQATest
|
| 129 |
+
--eval_all
|
| 130 |
+
|
| 131 |
+
# model name and model path
|
| 132 |
+
--model_name
|
| 133 |
+
--model_path
|
| 134 |
+
# load model from ckpt
|
| 135 |
+
--ckpt
|
| 136 |
+
# the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
|
| 137 |
+
--generate_method
|
| 138 |
+
|
| 139 |
+
--batchsize
|
| 140 |
+
|
| 141 |
+
# path to save the outputs
|
| 142 |
+
--answer_path
|
| 143 |
+
```
|
| 144 |
+
<br />
|
| 145 |
+
|
| 146 |
+
While evaluating on different tasks, parameters need to be set as follows:
|
| 147 |
+
###### TextVQA
|
| 148 |
+
```bash
|
| 149 |
+
--eval_textVQA
|
| 150 |
+
--textVQA_image_dir ./downloads/TextVQA/train_images
|
| 151 |
+
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
###### DocVQA
|
| 155 |
+
```bash
|
| 156 |
+
--eval_docVQA
|
| 157 |
+
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 158 |
+
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
###### DocVQATest
|
| 162 |
+
```bash
|
| 163 |
+
--eval_docVQATest
|
| 164 |
+
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 165 |
+
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
<br />
|
| 169 |
+
|
| 170 |
+
For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
|
| 171 |
+
```bash
|
| 172 |
+
chmod +x ./shell/run_transform.sh
|
| 173 |
+
./shell/run_transform.sh
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
<br />
|
| 177 |
+
|
| 178 |
+
## MiniCPM-V 2.6
|
| 179 |
+
|
| 180 |
+
<details>
|
| 181 |
+
<summary>Expand</summary>
|
| 182 |
+
|
| 183 |
+
### opencompass
|
| 184 |
+
First, enter the `vlmevalkit` directory and install all dependencies:
|
| 185 |
+
```bash
|
| 186 |
+
cd vlmevalkit
|
| 187 |
+
pip install --upgrade pip
|
| 188 |
+
pip install -e .
|
| 189 |
+
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
| 190 |
+
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
| 191 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 192 |
+
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 193 |
+
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 194 |
+
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 195 |
+
```
|
| 196 |
+
<br />
|
| 197 |
+
|
| 198 |
+
Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
|
| 199 |
+
```bash
|
| 200 |
+
chmod +x ./scripts/run_inference.sh
|
| 201 |
+
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
| 202 |
+
```
|
| 203 |
+
<br />
|
| 204 |
+
|
| 205 |
+
The four available choices for `MODELNAME` are listed in `vlmeval/config.py`:
|
| 206 |
+
```bash
|
| 207 |
+
minicpm_series = {
|
| 208 |
+
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 209 |
+
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 210 |
+
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 211 |
+
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
| 212 |
+
}
|
| 213 |
+
```
|
| 214 |
+
<br />
|
| 215 |
+
|
| 216 |
+
All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. Separate the names of different datasets with spaces and add quotation marks at both ends:
|
| 217 |
+
```bash
|
| 218 |
+
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
|
| 219 |
+
```
|
| 220 |
+
<br />
|
| 221 |
+
|
| 222 |
+
While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and HallusionBench), you need to run the script according to the following settings:
|
| 223 |
+
```bash
|
| 224 |
+
# without CoT
|
| 225 |
+
./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
|
| 226 |
+
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
| 227 |
+
# with CoT
|
| 228 |
+
# While running the CoT version of MME, you need to modify the 'use_cot' function in vlmeval/vlm/minicpm_v.py and add MME to the branch that returns True.
|
| 229 |
+
./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
|
| 230 |
+
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
| 231 |
+
```
|
| 232 |
+
<br />
|
| 233 |
+
|
| 234 |
+
### vqadataset
|
| 235 |
+
First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
|
| 236 |
+
```bash
|
| 237 |
+
cd vqaeval
|
| 238 |
+
pip install -r requirements.txt
|
| 239 |
+
mkdir downloads
|
| 240 |
+
```
|
| 241 |
+
<br />
|
| 242 |
+
|
| 243 |
+
Download the datasets from the following links and place it in the specified directories:
|
| 244 |
+
###### TextVQA
|
| 245 |
+
```bash
|
| 246 |
+
cd downloads
|
| 247 |
+
mkdir TextVQA && cd TextVQA
|
| 248 |
+
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
| 249 |
+
unzip train_val_images.zip && rm train_val_images.zip
|
| 250 |
+
mv train_val_images/train_images . && rm -rf train_val_images
|
| 251 |
+
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
| 252 |
+
cd ../..
|
| 253 |
+
```
|
| 254 |
+
|
| 255 |
+
###### DocVQA / DocVQATest
|
| 256 |
+
|
| 257 |
+
```bash
|
| 258 |
+
cd downloads
|
| 259 |
+
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
| 260 |
+
# Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
|
| 261 |
+
# Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
|
| 262 |
+
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
| 263 |
+
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
| 264 |
+
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
| 265 |
+
cd ../..
|
| 266 |
+
```
|
| 267 |
+
<br />
|
| 268 |
+
|
| 269 |
+
The `downloads` directory should be organized according to the following structure:
|
| 270 |
+
```bash
|
| 271 |
+
downloads
|
| 272 |
+
├── TextVQA
|
| 273 |
+
│ ├── train_images
|
| 274 |
+
│ │ ├── ...
|
| 275 |
+
│ ├── TextVQA_0.5.1_val.json
|
| 276 |
+
├── DocVQA
|
| 277 |
+
│ ├── spdocvqa_images
|
| 278 |
+
│ │ ├── ...
|
| 279 |
+
│ ├── val_v1.0_withQT.json
|
| 280 |
+
│ ├── test_v1.0.json
|
| 281 |
+
```
|
| 282 |
+
<br />
|
| 283 |
+
|
| 284 |
+
Modify the parameters in `shell/run_inference.sh` and run inference:
|
| 285 |
+
|
| 286 |
+
```bash
|
| 287 |
+
chmod +x ./shell/run_inference.sh
|
| 288 |
+
./shell/run_inference.sh
|
| 289 |
+
```
|
| 290 |
+
<br />
|
| 291 |
+
|
| 292 |
+
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
|
| 293 |
+
For `MiniCPM-V-2_6`, set `model_name` to `minicpmv26`:
|
| 294 |
+
```bash
|
| 295 |
+
# path to images and their corresponding questions
|
| 296 |
+
# TextVQA
|
| 297 |
+
--textVQA_image_dir
|
| 298 |
+
--textVQA_ann_path
|
| 299 |
+
# DocVQA
|
| 300 |
+
--docVQA_image_dir
|
| 301 |
+
--docVQA_ann_path
|
| 302 |
+
# DocVQATest
|
| 303 |
+
--docVQATest_image_dir
|
| 304 |
+
--docVQATest_ann_path
|
| 305 |
+
|
| 306 |
+
# whether to eval on certain task
|
| 307 |
+
--eval_textVQA
|
| 308 |
+
--eval_docVQA
|
| 309 |
+
--eval_docVQATest
|
| 310 |
+
--eval_all
|
| 311 |
+
|
| 312 |
+
# model name and model path
|
| 313 |
+
--model_name
|
| 314 |
+
--model_path
|
| 315 |
+
# load model from ckpt
|
| 316 |
+
--ckpt
|
| 317 |
+
# the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
|
| 318 |
+
--generate_method
|
| 319 |
+
|
| 320 |
+
--batchsize
|
| 321 |
+
|
| 322 |
+
# path to save the outputs
|
| 323 |
+
--answer_path
|
| 324 |
+
```
|
| 325 |
+
<br />
|
| 326 |
+
|
| 327 |
+
While evaluating on different tasks, parameters need to be set as follows:
|
| 328 |
+
###### TextVQA
|
| 329 |
+
```bash
|
| 330 |
+
--eval_textVQA
|
| 331 |
+
--textVQA_image_dir ./downloads/TextVQA/train_images
|
| 332 |
+
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
| 333 |
+
```
|
| 334 |
+
|
| 335 |
+
###### DocVQA
|
| 336 |
+
```bash
|
| 337 |
+
--eval_docVQA
|
| 338 |
+
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 339 |
+
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
| 340 |
+
```
|
| 341 |
+
|
| 342 |
+
###### DocVQATest
|
| 343 |
+
```bash
|
| 344 |
+
--eval_docVQATest
|
| 345 |
+
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 346 |
+
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
<br />
|
| 350 |
+
|
| 351 |
+
For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
|
| 352 |
+
```bash
|
| 353 |
+
chmod +x ./shell/run_transform.sh
|
| 354 |
+
./shell/run_transform.sh
|
| 355 |
+
```
|
| 356 |
+
|
| 357 |
+
</details>
|
| 358 |
+
|
| 359 |
+
<br />
|
| 360 |
+
|
| 361 |
+
## MiniCPM-Llama3-V-2_5
|
| 362 |
+
|
| 363 |
+
<details>
|
| 364 |
+
<summary>Expand</summary>
|
| 365 |
+
|
| 366 |
+
### opencompass
|
| 367 |
+
First, enter the `vlmevalkit` directory and install all dependencies:
|
| 368 |
+
```bash
|
| 369 |
+
cd vlmevalkit
|
| 370 |
+
pip install -r requirements.txt
|
| 371 |
+
```
|
| 372 |
+
<br />
|
| 373 |
+
|
| 374 |
+
Then, run `scripts/run_inference.sh`, which receives three input parameters in sequence: `MODELNAME`, `DATALIST`, and `MODE`. `MODELNAME` represents the name of the model, `DATALIST` represents the datasets used for inference, and `MODE` represents evaluation mode:
|
| 375 |
+
```bash
|
| 376 |
+
chmod +x ./scripts/run_inference.sh
|
| 377 |
+
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
| 378 |
+
```
|
| 379 |
+
<br />
|
| 380 |
+
|
| 381 |
+
The three available choices for `MODELNAME` are listed in `vlmeval/config.py`:
|
| 382 |
+
```bash
|
| 383 |
+
ungrouped = {
|
| 384 |
+
'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 385 |
+
'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 386 |
+
'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 387 |
+
}
|
| 388 |
+
```
|
| 389 |
+
<br />
|
| 390 |
+
|
| 391 |
+
All available choices for `DATALIST` are listed in `vlmeval/utils/dataset_config.py`. While evaluating on a single dataset, call the dataset name directly without quotation marks; while evaluating on multiple datasets, separate the names of different datasets with spaces and add quotation marks at both ends:
|
| 392 |
+
```bash
|
| 393 |
+
$DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
|
| 394 |
+
```
|
| 395 |
+
<br />
|
| 396 |
+
|
| 397 |
+
While scoring on each benchmark directly, set `MODE=all`. If only inference results are required, set `MODE=infer`. In order to reproduce the results in the table displayed on the homepage (columns between MME and RealWorldQA), you need to run the script according to the following settings:
|
| 398 |
+
```bash
|
| 399 |
+
# run on all 7 datasets
|
| 400 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
|
| 401 |
+
|
| 402 |
+
# The following are instructions for running on a single dataset
|
| 403 |
+
# MME
|
| 404 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
|
| 405 |
+
# MMBench_TEST_EN
|
| 406 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
|
| 407 |
+
# MMBench_TEST_CN
|
| 408 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
|
| 409 |
+
# MMMU_DEV_VAL
|
| 410 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
|
| 411 |
+
# MathVista_MINI
|
| 412 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
|
| 413 |
+
# LLaVABench
|
| 414 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
|
| 415 |
+
# RealWorldQA
|
| 416 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
|
| 417 |
+
```
|
| 418 |
+
<br />
|
| 419 |
+
|
| 420 |
+
### vqadataset
|
| 421 |
+
First, enter the `vqaeval` directory and install all dependencies. Then, create `downloads` subdirectory to store the downloaded dataset for all tasks:
|
| 422 |
+
```bash
|
| 423 |
+
cd vqaeval
|
| 424 |
+
pip install -r requirements.txt
|
| 425 |
+
mkdir downloads
|
| 426 |
+
```
|
| 427 |
+
<br />
|
| 428 |
+
|
| 429 |
+
Download the datasets from the following links and place it in the specified directories:
|
| 430 |
+
###### TextVQA
|
| 431 |
+
```bash
|
| 432 |
+
cd downloads
|
| 433 |
+
mkdir TextVQA && cd TextVQA
|
| 434 |
+
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
| 435 |
+
unzip train_val_images.zip && rm train_val_images.zip
|
| 436 |
+
mv train_val_images/train_images . && rm -rf train_val_images
|
| 437 |
+
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
| 438 |
+
cd ../..
|
| 439 |
+
```
|
| 440 |
+
|
| 441 |
+
###### DocVQA / DocVQATest
|
| 442 |
+
|
| 443 |
+
```bash
|
| 444 |
+
cd downloads
|
| 445 |
+
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
| 446 |
+
# Download Images and Annotations from Task 1 - Single Page Document Visual Question Answering at https://rrc.cvc.uab.es/?ch=17&com=downloads
|
| 447 |
+
# Move the spdocvqa_images.tar.gz and spdocvqa_qas.zip to DocVQA directory
|
| 448 |
+
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
| 449 |
+
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
| 450 |
+
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
| 451 |
+
cd ../..
|
| 452 |
+
```
|
| 453 |
+
<br />
|
| 454 |
+
|
| 455 |
+
The `downloads` directory should be organized according to the following structure:
|
| 456 |
+
```bash
|
| 457 |
+
downloads
|
| 458 |
+
├── TextVQA
|
| 459 |
+
│ ├── train_images
|
| 460 |
+
│ │ ├── ...
|
| 461 |
+
│ ├── TextVQA_0.5.1_val.json
|
| 462 |
+
├── DocVQA
|
| 463 |
+
│ ├── spdocvqa_images
|
| 464 |
+
│ │ ├── ...
|
| 465 |
+
│ ├── val_v1.0_withQT.json
|
| 466 |
+
│ ├── test_v1.0.json
|
| 467 |
+
```
|
| 468 |
+
<br />
|
| 469 |
+
|
| 470 |
+
Modify the parameters in `shell/run_inference.sh` and run inference:
|
| 471 |
+
|
| 472 |
+
```bash
|
| 473 |
+
chmod +x ./shell/run_inference.sh
|
| 474 |
+
./shell/run_inference.sh
|
| 475 |
+
```
|
| 476 |
+
<br />
|
| 477 |
+
|
| 478 |
+
All optional parameters are listed in `eval_utils/getargs.py`. The meanings of some major parameters are listed as follows.
|
| 479 |
+
For `MiniCPM-Llama3-V-2_5`, set `model_name` to `minicpmv`:
|
| 480 |
+
```bash
|
| 481 |
+
# path to images and their corresponding questions
|
| 482 |
+
# TextVQA
|
| 483 |
+
--textVQA_image_dir
|
| 484 |
+
--textVQA_ann_path
|
| 485 |
+
# DocVQA
|
| 486 |
+
--docVQA_image_dir
|
| 487 |
+
--docVQA_ann_path
|
| 488 |
+
# DocVQATest
|
| 489 |
+
--docVQATest_image_dir
|
| 490 |
+
--docVQATest_ann_path
|
| 491 |
+
|
| 492 |
+
# whether to eval on certain task
|
| 493 |
+
--eval_textVQA
|
| 494 |
+
--eval_docVQA
|
| 495 |
+
--eval_docVQATest
|
| 496 |
+
--eval_all
|
| 497 |
+
|
| 498 |
+
# model name and model path
|
| 499 |
+
--model_name
|
| 500 |
+
--model_path
|
| 501 |
+
# load model from ckpt
|
| 502 |
+
--ckpt
|
| 503 |
+
# the way the model processes input data, "interleave" represents interleaved image-text form, while "old" represents non-interleaved.
|
| 504 |
+
--generate_method
|
| 505 |
+
|
| 506 |
+
--batchsize
|
| 507 |
+
|
| 508 |
+
# path to save the outputs
|
| 509 |
+
--answer_path
|
| 510 |
+
```
|
| 511 |
+
<br />
|
| 512 |
+
|
| 513 |
+
While evaluating on different tasks, parameters need to be set as follows:
|
| 514 |
+
###### TextVQA
|
| 515 |
+
```bash
|
| 516 |
+
--eval_textVQA
|
| 517 |
+
--textVQA_image_dir ./downloads/TextVQA/train_images
|
| 518 |
+
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
| 519 |
+
```
|
| 520 |
+
|
| 521 |
+
###### DocVQA
|
| 522 |
+
```bash
|
| 523 |
+
--eval_docVQA
|
| 524 |
+
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 525 |
+
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
| 526 |
+
```
|
| 527 |
+
|
| 528 |
+
###### DocVQATest
|
| 529 |
+
```bash
|
| 530 |
+
--eval_docVQATest
|
| 531 |
+
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 532 |
+
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
| 533 |
+
```
|
| 534 |
+
|
| 535 |
+
<br />
|
| 536 |
+
|
| 537 |
+
For the DocVQATest task, in order to upload the inference results to the [official website](https://rrc.cvc.uab.es/?ch=17) for evaluation, run `shell/run_transform.sh` for format transformation after inference. `input_file_path` represents the path to the original output json, `output_file_path` represents the path to the transformed json:
|
| 538 |
+
```bash
|
| 539 |
+
chmod +x ./shell/run_transform.sh
|
| 540 |
+
./shell/run_transform.sh
|
| 541 |
+
```
|
| 542 |
+
|
| 543 |
+
</details>
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/README_zh.md
ADDED
|
@@ -0,0 +1,537 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluation
|
| 2 |
+
|
| 3 |
+
## MiniCPM-o 2.6
|
| 4 |
+
|
| 5 |
+
### opencompass
|
| 6 |
+
首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
|
| 7 |
+
```bash
|
| 8 |
+
cd vlmevalkit
|
| 9 |
+
pip install --upgrade pip
|
| 10 |
+
pip install -e .
|
| 11 |
+
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
| 12 |
+
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
| 13 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 14 |
+
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 15 |
+
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 16 |
+
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 17 |
+
rm *.whl
|
| 18 |
+
```
|
| 19 |
+
<br />
|
| 20 |
+
|
| 21 |
+
然后,运行 `scripts/run_inference.sh`,该脚本依次接收两个输入参数:`MODELNAME`, `DATALIST`。其中,`MODELNAME` 为模型名称,`DATALIST` 为目标数据集。
|
| 22 |
+
```bash
|
| 23 |
+
chmod +x ./scripts/run_inference.sh
|
| 24 |
+
./scripts/run_inference.sh $MODELNAME $DATALIST
|
| 25 |
+
```
|
| 26 |
+
<br />
|
| 27 |
+
|
| 28 |
+
`MODELNAME` 有五种选择,位于 `vlmeval/config.py` 中:
|
| 29 |
+
```bash
|
| 30 |
+
minicpm_series = {
|
| 31 |
+
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 32 |
+
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 33 |
+
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 34 |
+
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
| 35 |
+
'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
|
| 36 |
+
}
|
| 37 |
+
```
|
| 38 |
+
<br />
|
| 39 |
+
|
| 40 |
+
可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。一次评测多个数据集时,将不同数据集名称以空格隔开,两端加引号:
|
| 41 |
+
```bash
|
| 42 |
+
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST"
|
| 43 |
+
```
|
| 44 |
+
<br />
|
| 45 |
+
|
| 46 |
+
当评测的 benchmark 需要 GPT 系列模型进行评分时,请在 `.env` 文件中预先指定 `OPENAI_API_BASE` 和 `OPENAI_API_KEY`。
|
| 47 |
+
为了复现出首页展示的表格中 OpenCompass 对应的各项数据集以及 ChartQA 和 MME 上的结果(OCRBench 到 HallusionBench 之间的列),需要按照如下设置运行:
|
| 48 |
+
```bash
|
| 49 |
+
# 请注意,对于 MME 的 perception 和 reasoning 集,我们采取了不同的 prompt 方式。评测 reasoning 子集时,需要使用 CoT,因此需要手动到 vlmeval/vlm/minicpm_v.py 中修改 use_cot 函数的判断条件
|
| 50 |
+
./scripts/run_inference.sh MiniCPM-o-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar HallusionBench AI2D_TEST OCRBench ChartQA_TEST MME"
|
| 51 |
+
```
|
| 52 |
+
<br />
|
| 53 |
+
|
| 54 |
+
### vqadataset
|
| 55 |
+
首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
|
| 56 |
+
```bash
|
| 57 |
+
cd vqaeval
|
| 58 |
+
pip install -r requirements.txt
|
| 59 |
+
mkdir downloads
|
| 60 |
+
```
|
| 61 |
+
<br />
|
| 62 |
+
|
| 63 |
+
然后,从下列各地址下载数据集并置于指定目录下:
|
| 64 |
+
###### TextVQA
|
| 65 |
+
```bash
|
| 66 |
+
cd downloads
|
| 67 |
+
mkdir TextVQA && cd TextVQA
|
| 68 |
+
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
| 69 |
+
unzip train_val_images.zip && rm train_val_images.zip
|
| 70 |
+
mv train_val_images/train_images . && rm -rf train_val_images
|
| 71 |
+
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
| 72 |
+
cd ../..
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
###### DocVQA / DocVQATest
|
| 76 |
+
```bash
|
| 77 |
+
cd downloads
|
| 78 |
+
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
| 79 |
+
# 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
|
| 80 |
+
# 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
|
| 81 |
+
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
| 82 |
+
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
| 83 |
+
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
| 84 |
+
cd ../..
|
| 85 |
+
```
|
| 86 |
+
<br />
|
| 87 |
+
|
| 88 |
+
`downloads` 目录应当按照下列结构组织:
|
| 89 |
+
```bash
|
| 90 |
+
downloads
|
| 91 |
+
├── TextVQA
|
| 92 |
+
│ ├── train_images
|
| 93 |
+
│ │ ├── ...
|
| 94 |
+
│ ├── TextVQA_0.5.1_val.json
|
| 95 |
+
├── DocVQA
|
| 96 |
+
│ ├── spdocvqa_images
|
| 97 |
+
│ │ ├── ...
|
| 98 |
+
│ ├── val_v1.0_withQT.json
|
| 99 |
+
│ ├── test_v1.0.json
|
| 100 |
+
```
|
| 101 |
+
<br />
|
| 102 |
+
|
| 103 |
+
准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
chmod +x ./shell/run_inference.sh
|
| 107 |
+
./shell/run_inference.sh
|
| 108 |
+
```
|
| 109 |
+
<br />
|
| 110 |
+
|
| 111 |
+
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
|
| 112 |
+
对于 `MiniCPM-o-2_6`,需要将 `model_name`设置为 `minicpmo26`:
|
| 113 |
+
```bash
|
| 114 |
+
# 指定 TextVQA 评测所有图片和问题的路径
|
| 115 |
+
--textVQA_image_dir
|
| 116 |
+
--textVQA_ann_path
|
| 117 |
+
# 指定 DocVQA 评测所有图片和问题的路径
|
| 118 |
+
--docVQA_image_dir
|
| 119 |
+
--docVQA_ann_path
|
| 120 |
+
# 指定 DocVQATest 评测所有图片和问题的路径
|
| 121 |
+
--docVQATest_image_dir
|
| 122 |
+
--docVQATest_ann_path
|
| 123 |
+
|
| 124 |
+
# 决定是否评测某��任务,eval_all 设置为 True 表示所有任务都评测
|
| 125 |
+
--eval_textVQA
|
| 126 |
+
--eval_docVQA
|
| 127 |
+
--eval_docVQATest
|
| 128 |
+
--eval_all
|
| 129 |
+
|
| 130 |
+
# 模型名称、模型路径(从指定路径加载模型)
|
| 131 |
+
--model_name
|
| 132 |
+
--model_path
|
| 133 |
+
# 从 checkpoint 加载模型
|
| 134 |
+
--ckpt
|
| 135 |
+
# 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
|
| 136 |
+
--generate_method
|
| 137 |
+
# 推理时的批处理规模,建议推理时设置为 1
|
| 138 |
+
--batchsize
|
| 139 |
+
|
| 140 |
+
# 输出内容保存的路径
|
| 141 |
+
--answer_path
|
| 142 |
+
```
|
| 143 |
+
<br />
|
| 144 |
+
|
| 145 |
+
评测三个任务需要设置的参数如下:
|
| 146 |
+
###### TextVQA
|
| 147 |
+
```bash
|
| 148 |
+
--eval_textVQA
|
| 149 |
+
--textVQA_image_dir ./downloads/TextVQA/train_images
|
| 150 |
+
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
###### DocVQA
|
| 154 |
+
```bash
|
| 155 |
+
--eval_docVQA
|
| 156 |
+
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 157 |
+
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
###### DocVQATest
|
| 161 |
+
```bash
|
| 162 |
+
--eval_docVQATest
|
| 163 |
+
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 164 |
+
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
| 165 |
+
```
|
| 166 |
+
<br />
|
| 167 |
+
|
| 168 |
+
对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
|
| 169 |
+
```bash
|
| 170 |
+
chmod +x ./shell/run_transform.sh
|
| 171 |
+
./shell/run_transform.sh
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
<br />
|
| 175 |
+
|
| 176 |
+
## MiniCPM-V 2.6
|
| 177 |
+
|
| 178 |
+
<details>
|
| 179 |
+
<summary>展开</summary>
|
| 180 |
+
|
| 181 |
+
### opencompass
|
| 182 |
+
首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
|
| 183 |
+
```bash
|
| 184 |
+
cd vlmevalkit
|
| 185 |
+
pip install --upgrade pip
|
| 186 |
+
pip install -e .
|
| 187 |
+
wget https://download.pytorch.org/whl/cu118/torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=4377e0a7fe8ff8ffc4f7c9c6130c1dcd3874050ae4fc28b7ff1d35234fbca423
|
| 188 |
+
wget https://download.pytorch.org/whl/cu118/torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=2e63d62e09d9b48b407d3e1b30eb8ae4e3abad6968e8d33093b60d0657542428
|
| 189 |
+
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 190 |
+
pip install torch-2.2.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 191 |
+
pip install torchvision-0.17.0%2Bcu118-cp310-cp310-linux_x86_64.whl
|
| 192 |
+
pip install flash_attn-2.6.3+cu118torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
|
| 193 |
+
rm *.whl
|
| 194 |
+
```
|
| 195 |
+
<br />
|
| 196 |
+
|
| 197 |
+
然后,运行 `scripts/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
|
| 198 |
+
```bash
|
| 199 |
+
chmod +x ./scripts/run_inference.sh
|
| 200 |
+
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
| 201 |
+
```
|
| 202 |
+
<br />
|
| 203 |
+
|
| 204 |
+
`MODELNAME` 有四种选择,位于 `vlmeval/config.py` 中:
|
| 205 |
+
```bash
|
| 206 |
+
minicpm_series = {
|
| 207 |
+
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 208 |
+
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 209 |
+
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 210 |
+
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
| 211 |
+
}
|
| 212 |
+
```
|
| 213 |
+
<br />
|
| 214 |
+
|
| 215 |
+
可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中。将不同数据集名称以空格隔开,两端加引号:
|
| 216 |
+
```bash
|
| 217 |
+
$DATALIST="MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST"
|
| 218 |
+
```
|
| 219 |
+
<br />
|
| 220 |
+
|
| 221 |
+
直接对各 benchmark 进行评分时,设置 `MODE=all`。如果仅需要推理结果,则设置 `MODE=infer`。
|
| 222 |
+
为了复现出首页展示的表格中的各项结果(MME 到 HallusionBench 之间的列),需要按照如下设置运行:
|
| 223 |
+
```bash
|
| 224 |
+
# without CoT
|
| 225 |
+
./scripts/run_inference.sh MiniCPM-V-2_6 "MMMU_DEV_VAL MathVista_MINI MMVet MMBench_DEV_EN_V11 MMBench_DEV_CN_V11 MMStar HallusionBench AI2D_TEST" all
|
| 226 |
+
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
| 227 |
+
# with CoT,运行 CoT 版本的 MME 时,需要改写 vlmeval/vlm/minicpm_v.py 中的 'use_cot' 函数,将 MME 添加到 return True 的分支中
|
| 228 |
+
./scripts/run_inference/sh MiniCPM-V-2_6 "MMMU_DEV_VAL MMVet MMStar HallusionBench OCRBench" all
|
| 229 |
+
./scripts/run_inference.sh MiniCPM-V-2_6 MME all
|
| 230 |
+
```
|
| 231 |
+
<br />
|
| 232 |
+
|
| 233 |
+
### vqadataset
|
| 234 |
+
首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
|
| 235 |
+
```bash
|
| 236 |
+
cd vqaeval
|
| 237 |
+
pip install -r requirements.txt
|
| 238 |
+
mkdir downloads
|
| 239 |
+
```
|
| 240 |
+
<br />
|
| 241 |
+
|
| 242 |
+
然后,从下列各地址下载数据集并置于指定目录下:
|
| 243 |
+
###### TextVQA
|
| 244 |
+
```bash
|
| 245 |
+
cd downloads
|
| 246 |
+
mkdir TextVQA && cd TextVQA
|
| 247 |
+
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
| 248 |
+
unzip train_val_images.zip && rm train_val_images.zip
|
| 249 |
+
mv train_val_images/train_images . && rm -rf train_val_images
|
| 250 |
+
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
| 251 |
+
cd ../..
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
###### DocVQA / DocVQATest
|
| 255 |
+
```bash
|
| 256 |
+
cd downloads
|
| 257 |
+
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
| 258 |
+
# 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images ��� Annotations
|
| 259 |
+
# 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
|
| 260 |
+
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
| 261 |
+
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
| 262 |
+
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
| 263 |
+
cd ../..
|
| 264 |
+
```
|
| 265 |
+
<br />
|
| 266 |
+
|
| 267 |
+
`downloads` 目录应当按照下列结构组织:
|
| 268 |
+
```bash
|
| 269 |
+
downloads
|
| 270 |
+
├── TextVQA
|
| 271 |
+
│ ├── train_images
|
| 272 |
+
│ │ ├── ...
|
| 273 |
+
│ ├── TextVQA_0.5.1_val.json
|
| 274 |
+
├── DocVQA
|
| 275 |
+
│ ├── spdocvqa_images
|
| 276 |
+
│ │ ├── ...
|
| 277 |
+
│ ├── val_v1.0_withQT.json
|
| 278 |
+
│ ├── test_v1.0.json
|
| 279 |
+
```
|
| 280 |
+
<br />
|
| 281 |
+
|
| 282 |
+
准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
|
| 283 |
+
|
| 284 |
+
```bash
|
| 285 |
+
chmod +x ./shell/run_inference.sh
|
| 286 |
+
./shell/run_inference.sh
|
| 287 |
+
```
|
| 288 |
+
<br />
|
| 289 |
+
|
| 290 |
+
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
|
| 291 |
+
对于 `MiniCPM-V-2_6`,需要将 `model_name`设置为 `minicpmv26`:
|
| 292 |
+
```bash
|
| 293 |
+
# 指定 TextVQA 评测所有图片和问题的路径
|
| 294 |
+
--textVQA_image_dir
|
| 295 |
+
--textVQA_ann_path
|
| 296 |
+
# 指定 DocVQA 评测所有图片和问题的路径
|
| 297 |
+
--docVQA_image_dir
|
| 298 |
+
--docVQA_ann_path
|
| 299 |
+
# 指定 DocVQATest 评测所有图片和问题的路径
|
| 300 |
+
--docVQATest_image_dir
|
| 301 |
+
--docVQATest_ann_path
|
| 302 |
+
|
| 303 |
+
# 决定是否评测某个任务,eval_all 设置为 True 表示所有任务都评测
|
| 304 |
+
--eval_textVQA
|
| 305 |
+
--eval_docVQA
|
| 306 |
+
--eval_docVQATest
|
| 307 |
+
--eval_all
|
| 308 |
+
|
| 309 |
+
# 模型名称、模型路径(从指定路径加载模型)
|
| 310 |
+
--model_name
|
| 311 |
+
--model_path
|
| 312 |
+
# 从 checkpoint 加载模型
|
| 313 |
+
--ckpt
|
| 314 |
+
# 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
|
| 315 |
+
--generate_method
|
| 316 |
+
# 推理时的批处理规模,建议推理时设置为 1
|
| 317 |
+
--batchsize
|
| 318 |
+
|
| 319 |
+
# 输出内容保存的路径
|
| 320 |
+
--answer_path
|
| 321 |
+
```
|
| 322 |
+
<br />
|
| 323 |
+
|
| 324 |
+
评测三个任务需要设置的参数如下:
|
| 325 |
+
###### TextVQA
|
| 326 |
+
```bash
|
| 327 |
+
--eval_textVQA
|
| 328 |
+
--textVQA_image_dir ./downloads/TextVQA/train_images
|
| 329 |
+
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
###### DocVQA
|
| 333 |
+
```bash
|
| 334 |
+
--eval_docVQA
|
| 335 |
+
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 336 |
+
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
###### DocVQATest
|
| 340 |
+
```bash
|
| 341 |
+
--eval_docVQATest
|
| 342 |
+
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 343 |
+
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
| 344 |
+
```
|
| 345 |
+
<br />
|
| 346 |
+
|
| 347 |
+
对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
|
| 348 |
+
```bash
|
| 349 |
+
chmod +x ./shell/run_transform.sh
|
| 350 |
+
./shell/run_transform.sh
|
| 351 |
+
```
|
| 352 |
+
|
| 353 |
+
</details>
|
| 354 |
+
|
| 355 |
+
<br />
|
| 356 |
+
|
| 357 |
+
## MiniCPM-Llama3-V-2_5
|
| 358 |
+
|
| 359 |
+
<details>
|
| 360 |
+
<summary>展开</summary>
|
| 361 |
+
|
| 362 |
+
### opencompass
|
| 363 |
+
首先,进入 `vlmevalkit` 目录下,安装必要的依赖:
|
| 364 |
+
```bash
|
| 365 |
+
cd vlmevalkit
|
| 366 |
+
pip install -r requirements.txt
|
| 367 |
+
```
|
| 368 |
+
<br />
|
| 369 |
+
|
| 370 |
+
然后,运行 `scripts/run_inference.sh`,该脚本依次接收三个输入参数:`MODELNAME`, `DATALIST`, `MODE`。`MODELNAME` 为模型名称,`DATALIST` 为目标数据集,`MODE` 为评测模式。
|
| 371 |
+
```bash
|
| 372 |
+
chmod +x ./scripts/run_inference.sh
|
| 373 |
+
./scripts/run_inference.sh $MODELNAME $DATALIST $MODE
|
| 374 |
+
```
|
| 375 |
+
<br />
|
| 376 |
+
|
| 377 |
+
`MODELNAME` 有三种选择,位于 `vlmeval/config.py` 中:
|
| 378 |
+
```bash
|
| 379 |
+
ungrouped = {
|
| 380 |
+
'MiniCPM-V':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 381 |
+
'MiniCPM-V-2':partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 382 |
+
'MiniCPM-Llama3-V-2_5':partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 383 |
+
}
|
| 384 |
+
```
|
| 385 |
+
<br />
|
| 386 |
+
|
| 387 |
+
可选的所有 `DATALIST` 位于 `vlmeval/utils/dataset_config.py` 中,评测单个数据集时,直接调用数据集名称,不加引号;评测多个数据集时,将不同数据集名称以空格隔开,两端加引号:
|
| 388 |
+
```bash
|
| 389 |
+
$DATALIST="POPE ScienceQA_TEST ChartQA_TEST"
|
| 390 |
+
```
|
| 391 |
+
<br />
|
| 392 |
+
|
| 393 |
+
直接对各 benchmark 进行评分时,设置 `MODE=all`。如果仅需要推理结果,则设置 `MODE=infer`
|
| 394 |
+
为了复现出首页展示的表格中的各项结果(MME 到 RealWorldQA 之间的列),需要按照如下设置运行:
|
| 395 |
+
```bash
|
| 396 |
+
# 一次性运行 7 个数据集
|
| 397 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 "MME MMBench_TEST_EN MMBench_TEST_CN MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA" all
|
| 398 |
+
|
| 399 |
+
# 以下是单独运行 1 个数据集的指令
|
| 400 |
+
# MME
|
| 401 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MME all
|
| 402 |
+
# MMBench_TEST_EN
|
| 403 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_EN all
|
| 404 |
+
# MMBench_TEST_CN
|
| 405 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMBench_TEST_CN all
|
| 406 |
+
# MMMU_DEV_VAL
|
| 407 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MMMU_DEV_VAL all
|
| 408 |
+
# MathVista_MINI
|
| 409 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 MathVista_MINI all
|
| 410 |
+
# LLaVABench
|
| 411 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 LLaVABench all
|
| 412 |
+
# RealWorldQA
|
| 413 |
+
./scripts/run_inference.sh MiniCPM-Llama3-V-2_5 RealWorldQA all
|
| 414 |
+
```
|
| 415 |
+
<br />
|
| 416 |
+
|
| 417 |
+
### vqadataset
|
| 418 |
+
首先,进入 `vqaeval` 目录下,安装必要的依赖,并创建 `downloads` 子目录,用于存储下载的数据集:
|
| 419 |
+
```bash
|
| 420 |
+
cd vqaeval
|
| 421 |
+
pip install -r requirements.txt
|
| 422 |
+
mkdir downloads
|
| 423 |
+
```
|
| 424 |
+
<br />
|
| 425 |
+
|
| 426 |
+
然后,从下列各地址下载数据集并置于指定目录下:
|
| 427 |
+
###### TextVQA
|
| 428 |
+
```bash
|
| 429 |
+
cd downloads
|
| 430 |
+
mkdir TextVQA && cd TextVQA
|
| 431 |
+
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
| 432 |
+
unzip train_val_images.zip && rm train_val_images.zip
|
| 433 |
+
mv train_val_images/train_images . && rm -rf train_val_images
|
| 434 |
+
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json
|
| 435 |
+
cd ../..
|
| 436 |
+
```
|
| 437 |
+
|
| 438 |
+
###### DocVQA / DocVQATest
|
| 439 |
+
```bash
|
| 440 |
+
cd downloads
|
| 441 |
+
mkdir DocVQA && cd DocVQA && mkdir spdocvqa_images
|
| 442 |
+
# 在 https://rrc.cvc.uab.es/?ch=17&com=downloads 下载 Task 1 - Single Page Document Visual Question Answering 下的 Images 和 Annotations
|
| 443 |
+
# 将下载得到的 spdocvqa_images.tar.gz 以及 spdocvqa_qas.zip 置于 DocVQA 目录下
|
| 444 |
+
tar -zxvf spdocvqa_images.tar.gz -C spdocvqa_images && rm spdocvqa_images.tar.gz
|
| 445 |
+
unzip spdocvqa_qas.zip && rm spdocvqa_qas.zip
|
| 446 |
+
cp spdocvqa_qas/val_v1.0_withQT.json . && cp spdocvqa_qas/test_v1.0.json . && rm -rf spdocvqa_qas
|
| 447 |
+
cd ../..
|
| 448 |
+
```
|
| 449 |
+
<br />
|
| 450 |
+
|
| 451 |
+
`downloads` 目录应当按照下列结构组织:
|
| 452 |
+
```bash
|
| 453 |
+
downloads
|
| 454 |
+
├── TextVQA
|
| 455 |
+
│ ├── train_images
|
| 456 |
+
│ │ ├── ...
|
| 457 |
+
│ ├── TextVQA_0.5.1_val.json
|
| 458 |
+
├── DocVQA
|
| 459 |
+
│ ├── spdocvqa_images
|
| 460 |
+
│ │ ├── ...
|
| 461 |
+
│ ├── val_v1.0_withQT.json
|
| 462 |
+
│ ├── test_v1.0.json
|
| 463 |
+
```
|
| 464 |
+
<br />
|
| 465 |
+
|
| 466 |
+
准备好相应的数据集之后,修改 `shell/run_inference.sh` 的参数,运行推理:
|
| 467 |
+
|
| 468 |
+
```bash
|
| 469 |
+
chmod +x ./shell/run_inference.sh
|
| 470 |
+
./shell/run_inference.sh
|
| 471 |
+
```
|
| 472 |
+
<br />
|
| 473 |
+
|
| 474 |
+
可以传入的参数位于 `eval_utils/getargs.py` 中,各主要参数的含义如下。
|
| 475 |
+
对于 `MiniCPM-Llama3-V-2_5`,需要将 `model_name` 设置为 `minicpmv`:
|
| 476 |
+
```bash
|
| 477 |
+
# 指定 TextVQA 评测所有图片和问题的路径
|
| 478 |
+
--textVQA_image_dir
|
| 479 |
+
--textVQA_ann_path
|
| 480 |
+
# 指定 DocVQA 评测所有图片和问题的路径
|
| 481 |
+
--docVQA_image_dir
|
| 482 |
+
--docVQA_ann_path
|
| 483 |
+
# 指定 DocVQATest 评测所有图片和问题的路径
|
| 484 |
+
--docVQATest_image_dir
|
| 485 |
+
--docVQATest_ann_path
|
| 486 |
+
|
| 487 |
+
# 决定是否评测某个任务,eval_all 设置为 True 表示所有任务都评测
|
| 488 |
+
--eval_textVQA
|
| 489 |
+
--eval_docVQA
|
| 490 |
+
--eval_docVQATest
|
| 491 |
+
--eval_all
|
| 492 |
+
|
| 493 |
+
# 模型名称、模型路径(从指定路径加载模型)
|
| 494 |
+
--model_name
|
| 495 |
+
--model_path
|
| 496 |
+
# 从 checkpoint 加载模型
|
| 497 |
+
--ckpt
|
| 498 |
+
# 模型处理输入数据的方式,interleave 表示图文交错式,old 表示非交错式
|
| 499 |
+
--generate_method
|
| 500 |
+
# 推理时的批处理规模,建议推理时设置为 1
|
| 501 |
+
--batchsize
|
| 502 |
+
|
| 503 |
+
# 输出内容保存的路径
|
| 504 |
+
--answer_path
|
| 505 |
+
```
|
| 506 |
+
<br />
|
| 507 |
+
|
| 508 |
+
评测三个任务需要设置的参数如下:
|
| 509 |
+
###### TextVQA
|
| 510 |
+
```bash
|
| 511 |
+
--eval_textVQA
|
| 512 |
+
--textVQA_image_dir ./downloads/TextVQA/train_images
|
| 513 |
+
--textVQA_ann_path ./downloads/TextVQA/TextVQA_0.5.1_val.json
|
| 514 |
+
```
|
| 515 |
+
|
| 516 |
+
###### DocVQA
|
| 517 |
+
```bash
|
| 518 |
+
--eval_docVQA
|
| 519 |
+
--docVQA_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 520 |
+
--docVQA_ann_path ./downloads/DocVQA/val_v1.0_withQT.json
|
| 521 |
+
```
|
| 522 |
+
|
| 523 |
+
###### DocVQATest
|
| 524 |
+
```bash
|
| 525 |
+
--eval_docVQATest
|
| 526 |
+
--docVQATest_image_dir ./downloads/DocVQA/spdocvqa_images
|
| 527 |
+
--docVQATest_ann_path ./downloads/DocVQA/test_v1.0.json
|
| 528 |
+
```
|
| 529 |
+
<br />
|
| 530 |
+
|
| 531 |
+
对于 DocVQATest 任务,为了将推理结果上传到[官方网站](https://rrc.cvc.uab.es/?ch=17)进行评测,还需要运行 `shell/run_transform.sh` 进行格式转换。其中,`input_file_path` 对应原始输出的 json 的路径,`output_file_path` 为自定义的转换后的 json 的路径:
|
| 532 |
+
```bash
|
| 533 |
+
chmod +x ./shell/run_transform.sh
|
| 534 |
+
./shell/run_transform.sh
|
| 535 |
+
```
|
| 536 |
+
|
| 537 |
+
</details>
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/.env
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# .env 文件,将其放置在 $VLMEvalKit 下
|
| 2 |
+
# 专有 VLMs 的 API 密钥
|
| 3 |
+
# QwenVL APIs
|
| 4 |
+
DASHSCOPE_API_KEY=
|
| 5 |
+
# Gemini w. Google Cloud Backends
|
| 6 |
+
GOOGLE_API_KEY=
|
| 7 |
+
# OpenAI API
|
| 8 |
+
OPENAI_API_KEY=
|
| 9 |
+
OPENAI_API_BASE=
|
| 10 |
+
# StepAI API
|
| 11 |
+
STEPAI_API_KEY=
|
| 12 |
+
# REKA API
|
| 13 |
+
REKA_API_KEY=
|
| 14 |
+
# GLMV API
|
| 15 |
+
GLMV_API_KEY=
|
| 16 |
+
# CongRong API
|
| 17 |
+
CW_API_BASE=
|
| 18 |
+
CW_API_KEY=
|
| 19 |
+
# SenseChat-V API
|
| 20 |
+
SENSECHAT_AK=
|
| 21 |
+
SENSECHAT_SK=
|
| 22 |
+
# Hunyuan-Vision API
|
| 23 |
+
HUNYUAN_SECRET_KEY=
|
| 24 |
+
HUNYUAN_SECRET_ID=
|
| 25 |
+
# LMDeploy API
|
| 26 |
+
LMDEPLOY_API_BASE=
|
| 27 |
+
# 你可以设置一个评估时代理,评估阶段产生的 API 调用将通过这个代理进行
|
| 28 |
+
EVAL_PROXY=
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements.txt
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
decord; platform_machine != 'arm64'
|
| 2 |
+
eva-decord; platform_machine == 'arm64'
|
| 3 |
+
gradio
|
| 4 |
+
huggingface_hub
|
| 5 |
+
imageio
|
| 6 |
+
matplotlib
|
| 7 |
+
numpy
|
| 8 |
+
omegaconf
|
| 9 |
+
openai
|
| 10 |
+
opencv-python>=4.4.0.46
|
| 11 |
+
openpyxl
|
| 12 |
+
pandas
|
| 13 |
+
pillow
|
| 14 |
+
portalocker
|
| 15 |
+
protobuf
|
| 16 |
+
python-dotenv
|
| 17 |
+
requests
|
| 18 |
+
rich
|
| 19 |
+
sentencepiece
|
| 20 |
+
setuptools
|
| 21 |
+
sty
|
| 22 |
+
tabulate
|
| 23 |
+
tiktoken
|
| 24 |
+
timeout-decorator
|
| 25 |
+
torch
|
| 26 |
+
tqdm
|
| 27 |
+
transformers
|
| 28 |
+
typing_extensions
|
| 29 |
+
validators
|
| 30 |
+
xlsxwriter
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/requirements/docs.txt
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
docutils==0.18.1
|
| 2 |
+
modelindex
|
| 3 |
+
myst-parser
|
| 4 |
+
-e git+https://github.com/open-compass/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
|
| 5 |
+
sphinx==6.1.3
|
| 6 |
+
sphinx-copybutton
|
| 7 |
+
sphinx-design
|
| 8 |
+
sphinx-notfound-page
|
| 9 |
+
sphinx-tabs
|
| 10 |
+
sphinxcontrib-jquery
|
| 11 |
+
tabulate
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/run.py
ADDED
|
@@ -0,0 +1,424 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
import torch.distributed as dist
|
| 3 |
+
|
| 4 |
+
from vlmeval.config import supported_VLM
|
| 5 |
+
from vlmeval.dataset.video_dataset_config import supported_video_datasets
|
| 6 |
+
from vlmeval.dataset import build_dataset
|
| 7 |
+
from vlmeval.inference import infer_data_job
|
| 8 |
+
from vlmeval.inference_video import infer_data_job_video
|
| 9 |
+
from vlmeval.inference_mt import infer_data_job_mt
|
| 10 |
+
from vlmeval.smp import *
|
| 11 |
+
from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def build_model_from_config(cfg, model_name):
|
| 15 |
+
import vlmeval.api
|
| 16 |
+
import vlmeval.vlm
|
| 17 |
+
config = cp.deepcopy(cfg[model_name])
|
| 18 |
+
if config == {}:
|
| 19 |
+
return supported_VLM[model_name]()
|
| 20 |
+
assert 'class' in config
|
| 21 |
+
cls_name = config.pop('class')
|
| 22 |
+
if hasattr(vlmeval.api, cls_name):
|
| 23 |
+
return getattr(vlmeval.api, cls_name)(**config)
|
| 24 |
+
elif hasattr(vlmeval.vlm, cls_name):
|
| 25 |
+
return getattr(vlmeval.vlm, cls_name)(**config)
|
| 26 |
+
else:
|
| 27 |
+
raise ValueError(f'Class {cls_name} is not supported in `vlmeval.api` or `vlmeval.vlm`')
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def build_dataset_from_config(cfg, dataset_name):
|
| 31 |
+
import vlmeval.dataset
|
| 32 |
+
import inspect
|
| 33 |
+
config = cp.deepcopy(cfg[dataset_name])
|
| 34 |
+
if config == {}:
|
| 35 |
+
return supported_video_datasets[dataset_name]()
|
| 36 |
+
assert 'class' in config
|
| 37 |
+
cls_name = config.pop('class')
|
| 38 |
+
if hasattr(vlmeval.dataset, cls_name):
|
| 39 |
+
cls = getattr(vlmeval.dataset, cls_name)
|
| 40 |
+
sig = inspect.signature(cls.__init__)
|
| 41 |
+
valid_params = {k: v for k, v in config.items() if k in sig.parameters}
|
| 42 |
+
if valid_params.get('fps', 0) > 0 and valid_params.get('nframe', 0) > 0:
|
| 43 |
+
raise ValueError('fps and nframe should not be set at the same time')
|
| 44 |
+
if valid_params.get('fps', 0) <= 0 and valid_params.get('nframe', 0) <= 0:
|
| 45 |
+
raise ValueError('fps and nframe should be set at least one valid value')
|
| 46 |
+
return cls(**valid_params)
|
| 47 |
+
else:
|
| 48 |
+
raise ValueError(f'Class {cls_name} is not supported in `vlmeval.dataset`')
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def parse_args():
|
| 52 |
+
help_msg = """\
|
| 53 |
+
You can launch the evaluation by setting either --data and --model or --config.
|
| 54 |
+
|
| 55 |
+
--data and --model:
|
| 56 |
+
Each Arg should be a list of strings, specifying the names of datasets and models.
|
| 57 |
+
To find all supported model names, please refer to the `vlmeval/config.py` of check the output of the command \
|
| 58 |
+
`vlmutil mlist all` in the terminal (you should first have vlmeval installed).
|
| 59 |
+
To find all supported dataset names, please refer to the `vlmeval/dataset/__init__.py` file. The python script \
|
| 60 |
+
to print all supported dataset names is as follows:
|
| 61 |
+
```python
|
| 62 |
+
from vlmeval.dataset import SUPPORTED_DATASETS
|
| 63 |
+
print(SUPPORTED_DATASETS)
|
| 64 |
+
```
|
| 65 |
+
or you can check the output of the command `vlmutil dlist all` in the terminal.
|
| 66 |
+
To find all supported video dataset default settings, please refer to the \
|
| 67 |
+
`vlmeval/dataset/video_dataset_config.py` file.
|
| 68 |
+
|
| 69 |
+
--config:
|
| 70 |
+
Launch the evaluation by specifying the path to the config json file. Sample Json Content:
|
| 71 |
+
```json
|
| 72 |
+
{
|
| 73 |
+
"model": {
|
| 74 |
+
"GPT4o_20240806_T00_HIGH": {
|
| 75 |
+
"class": "GPT4V",
|
| 76 |
+
"model": "gpt-4o-2024-08-06",
|
| 77 |
+
"temperature": 0,
|
| 78 |
+
"img_detail": "high"
|
| 79 |
+
},
|
| 80 |
+
"GPT4o_20240806_T10_Low": {
|
| 81 |
+
"class": "GPT4V",
|
| 82 |
+
"model": "gpt-4o-2024-08-06",
|
| 83 |
+
"temperature": 1.0,
|
| 84 |
+
"img_detail": "low"
|
| 85 |
+
},
|
| 86 |
+
"GPT4o_20241120": {}
|
| 87 |
+
},
|
| 88 |
+
"data": {
|
| 89 |
+
"MME-RealWorld-Lite": {
|
| 90 |
+
"class": "MMERealWorld",
|
| 91 |
+
"dataset": "MME-RealWorld-Lite"
|
| 92 |
+
},
|
| 93 |
+
"MMBench_DEV_EN_V11": {
|
| 94 |
+
"class": "ImageMCQDataset",
|
| 95 |
+
"dataset": "MMBench_DEV_EN_V11"
|
| 96 |
+
},
|
| 97 |
+
"MMBench_Video_8frame_nopack": {},
|
| 98 |
+
"Video-MME_16frame_subs": {
|
| 99 |
+
"class": "VideoMME",
|
| 100 |
+
"dataset": "Video-MME",
|
| 101 |
+
"nframe": 16,
|
| 102 |
+
"use_subtitle": true,
|
| 103 |
+
}
|
| 104 |
+
}
|
| 105 |
+
}
|
| 106 |
+
```
|
| 107 |
+
Currently, only `model` and `data` are supported fields. The content of each field is a dictionary.
|
| 108 |
+
For `model`, the key is the name of the model, and the value is a dictionary containing the following keys:
|
| 109 |
+
- `class`: The class name of the model, which should be a class in `vlmeval.vlm` or `vlmeval.api`.
|
| 110 |
+
- Other keys are specific to the model, please refer to the corresponding class.
|
| 111 |
+
- Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut.
|
| 112 |
+
For `data`, the key is the name of the dataset (should be the same as the `dataset` field in most cases, \
|
| 113 |
+
except for video datasets), and the value is a dictionary containing the following keys:
|
| 114 |
+
- `class`: The class name of the dataset, which should be a class in `vlmeval.dataset`.
|
| 115 |
+
- `dataset`: The name of the dataset, which should be a string that is accepted by the `dataset` argument of the \
|
| 116 |
+
corresponding class.
|
| 117 |
+
- Other keys are specific to the dataset, please refer to the corresponding class.
|
| 118 |
+
- Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` \
|
| 119 |
+
can be used as a shortcut.
|
| 120 |
+
|
| 121 |
+
The keys in the `model` and `data` fields will be used for naming the prediction files and evaluation results.
|
| 122 |
+
When launching with `--config`, args for API VLMs, such as `--retry`, `--verbose`, will be ignored.
|
| 123 |
+
"""
|
| 124 |
+
parser = argparse.ArgumentParser(description=help_msg, formatter_class=argparse.RawTextHelpFormatter)
|
| 125 |
+
# Essential Args, Setting the Names of Datasets and Models
|
| 126 |
+
parser.add_argument('--data', type=str, nargs='+', help='Names of Datasets')
|
| 127 |
+
parser.add_argument('--model', type=str, nargs='+', help='Names of Models')
|
| 128 |
+
parser.add_argument('--config', type=str, help='Path to the Config Json File')
|
| 129 |
+
# Work Dir
|
| 130 |
+
parser.add_argument('--work-dir', type=str, default='./outputs', help='select the output directory')
|
| 131 |
+
# Infer + Eval or Infer Only
|
| 132 |
+
parser.add_argument('--mode', type=str, default='all', choices=['all', 'infer'])
|
| 133 |
+
# API Kwargs, Apply to API VLMs and Judge API LLMs
|
| 134 |
+
parser.add_argument('--api_nproc', type=int, default=4, help='Parallel API calling')
|
| 135 |
+
parser.add_argument('--retry', type=int, default=None, help='retry numbers for API VLMs')
|
| 136 |
+
# Explicitly Set the Judge Model
|
| 137 |
+
parser.add_argument('--judge', type=str, default=None)
|
| 138 |
+
# Logging Utils
|
| 139 |
+
parser.add_argument('--verbose', action='store_true')
|
| 140 |
+
# Configuration for Resume
|
| 141 |
+
# Ignore: will not rerun failed VLM inference
|
| 142 |
+
parser.add_argument('--ignore', action='store_true', help='Ignore failed indices. ')
|
| 143 |
+
# Reuse: will reuse the existing prediction files
|
| 144 |
+
parser.add_argument('--reuse', action='store_true')
|
| 145 |
+
|
| 146 |
+
args = parser.parse_args()
|
| 147 |
+
return args
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def main():
|
| 151 |
+
logger = get_logger('RUN')
|
| 152 |
+
rank, world_size = get_rank_and_world_size()
|
| 153 |
+
args = parse_args()
|
| 154 |
+
use_config, cfg = False, None
|
| 155 |
+
if args.config is not None:
|
| 156 |
+
assert args.data is None and args.model is None, '--data and --model should not be set when using --config'
|
| 157 |
+
use_config, cfg = True, load(args.config)
|
| 158 |
+
args.model = list(cfg['model'].keys())
|
| 159 |
+
args.data = list(cfg['data'].keys())
|
| 160 |
+
else:
|
| 161 |
+
assert len(args.data), '--data should be a list of data files'
|
| 162 |
+
|
| 163 |
+
if rank == 0:
|
| 164 |
+
if not args.reuse:
|
| 165 |
+
logger.warning('--reuse is not set, will not reuse previous (before one day) temporary files')
|
| 166 |
+
else:
|
| 167 |
+
logger.warning('--reuse is set, will reuse the latest prediction & temporary pickle files')
|
| 168 |
+
|
| 169 |
+
if 'MMEVAL_ROOT' in os.environ:
|
| 170 |
+
args.work_dir = os.environ['MMEVAL_ROOT']
|
| 171 |
+
|
| 172 |
+
if not use_config:
|
| 173 |
+
for k, v in supported_VLM.items():
|
| 174 |
+
if hasattr(v, 'keywords') and 'retry' in v.keywords and args.retry is not None:
|
| 175 |
+
v.keywords['retry'] = args.retry
|
| 176 |
+
supported_VLM[k] = v
|
| 177 |
+
if hasattr(v, 'keywords') and 'verbose' in v.keywords and args.verbose is not None:
|
| 178 |
+
v.keywords['verbose'] = args.verbose
|
| 179 |
+
supported_VLM[k] = v
|
| 180 |
+
|
| 181 |
+
if world_size > 1:
|
| 182 |
+
local_rank = os.environ.get('LOCAL_RANK', 0)
|
| 183 |
+
torch.cuda.set_device(int(local_rank))
|
| 184 |
+
dist.init_process_group(
|
| 185 |
+
backend='nccl',
|
| 186 |
+
timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
|
| 187 |
+
)
|
| 188 |
+
|
| 189 |
+
for _, model_name in enumerate(args.model):
|
| 190 |
+
model = None
|
| 191 |
+
date, commit_id = timestr('day'), githash(digits=8)
|
| 192 |
+
eval_id = f"T{date}_G{commit_id}"
|
| 193 |
+
|
| 194 |
+
pred_root = osp.join(args.work_dir, model_name, eval_id)
|
| 195 |
+
pred_root_meta = osp.join(args.work_dir, model_name)
|
| 196 |
+
os.makedirs(pred_root_meta, exist_ok=True)
|
| 197 |
+
|
| 198 |
+
prev_pred_roots = ls(osp.join(args.work_dir, model_name), mode='dir')
|
| 199 |
+
if len(prev_pred_roots) and args.reuse:
|
| 200 |
+
prev_pred_roots.sort()
|
| 201 |
+
|
| 202 |
+
if not osp.exists(pred_root):
|
| 203 |
+
os.makedirs(pred_root, exist_ok=True)
|
| 204 |
+
|
| 205 |
+
if use_config:
|
| 206 |
+
model = build_model_from_config(cfg['model'], model_name)
|
| 207 |
+
|
| 208 |
+
for _, dataset_name in enumerate(args.data):
|
| 209 |
+
try:
|
| 210 |
+
result_file_base = f'{model_name}_{dataset_name}.xlsx'
|
| 211 |
+
|
| 212 |
+
if use_config:
|
| 213 |
+
if world_size > 1:
|
| 214 |
+
if rank == 0:
|
| 215 |
+
dataset = build_dataset_from_config(cfg['data'], dataset_name)
|
| 216 |
+
dist.barrier()
|
| 217 |
+
dataset = build_dataset_from_config(cfg['data'], dataset_name)
|
| 218 |
+
if dataset is None:
|
| 219 |
+
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
|
| 220 |
+
continue
|
| 221 |
+
else:
|
| 222 |
+
dataset_kwargs = {}
|
| 223 |
+
if dataset_name in ['MMLongBench_DOC', 'DUDE', 'DUDE_MINI', 'SLIDEVQA', 'SLIDEVQA_MINI']:
|
| 224 |
+
dataset_kwargs['model'] = model_name
|
| 225 |
+
|
| 226 |
+
# If distributed, first build the dataset on the main process for doing preparation works
|
| 227 |
+
if world_size > 1:
|
| 228 |
+
if rank == 0:
|
| 229 |
+
dataset = build_dataset(dataset_name, **dataset_kwargs)
|
| 230 |
+
dist.barrier()
|
| 231 |
+
|
| 232 |
+
dataset = build_dataset(dataset_name, **dataset_kwargs)
|
| 233 |
+
if dataset is None:
|
| 234 |
+
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
|
| 235 |
+
continue
|
| 236 |
+
|
| 237 |
+
# Handling Multi-Turn Dataset
|
| 238 |
+
if dataset.TYPE == 'MT':
|
| 239 |
+
result_file_base = result_file_base.replace('.xlsx', '.tsv')
|
| 240 |
+
|
| 241 |
+
result_file = osp.join(pred_root, result_file_base)
|
| 242 |
+
|
| 243 |
+
# Reuse the previous prediction file if exists
|
| 244 |
+
if rank == 0 and len(prev_pred_roots):
|
| 245 |
+
prev_result_file = None
|
| 246 |
+
prev_pkl_file_list = []
|
| 247 |
+
for root in prev_pred_roots[::-1]:
|
| 248 |
+
if osp.exists(osp.join(root, result_file_base)):
|
| 249 |
+
prev_result_file = osp.join(root, result_file_base)
|
| 250 |
+
break
|
| 251 |
+
elif commit_id in root and len(ls(root)) and root != pred_root:
|
| 252 |
+
temp_files = ls(root, match=[dataset_name, '.pkl'])
|
| 253 |
+
if len(temp_files):
|
| 254 |
+
prev_pkl_file_list.extend(temp_files)
|
| 255 |
+
break
|
| 256 |
+
if not args.reuse:
|
| 257 |
+
prev_result_file = None
|
| 258 |
+
prev_pkl_file_list = []
|
| 259 |
+
if prev_result_file is not None:
|
| 260 |
+
logger.warning(
|
| 261 |
+
f'--reuse is set, will reuse the prediction file {prev_result_file}.')
|
| 262 |
+
if prev_result_file != result_file:
|
| 263 |
+
shutil.copy(prev_result_file, result_file)
|
| 264 |
+
elif len(prev_pkl_file_list):
|
| 265 |
+
for fname in prev_pkl_file_list:
|
| 266 |
+
target_path = osp.join(pred_root, osp.basename(fname))
|
| 267 |
+
if not osp.exists(target_path):
|
| 268 |
+
shutil.copy(fname, target_path)
|
| 269 |
+
logger.info(f'--reuse is set, will reuse the prediction pickle file {fname}.')
|
| 270 |
+
else:
|
| 271 |
+
logger.warning(f'File already exists: {target_path}')
|
| 272 |
+
|
| 273 |
+
if world_size > 1:
|
| 274 |
+
dist.barrier()
|
| 275 |
+
|
| 276 |
+
if model is None:
|
| 277 |
+
model = model_name # which is only a name
|
| 278 |
+
|
| 279 |
+
# Perform the Inference
|
| 280 |
+
if dataset.MODALITY == 'VIDEO':
|
| 281 |
+
model = infer_data_job_video(
|
| 282 |
+
model,
|
| 283 |
+
work_dir=pred_root,
|
| 284 |
+
model_name=model_name,
|
| 285 |
+
dataset=dataset,
|
| 286 |
+
result_file_name=result_file_base,
|
| 287 |
+
verbose=args.verbose,
|
| 288 |
+
api_nproc=args.api_nproc)
|
| 289 |
+
elif dataset.TYPE == 'MT':
|
| 290 |
+
model = infer_data_job_mt(
|
| 291 |
+
model,
|
| 292 |
+
work_dir=pred_root,
|
| 293 |
+
model_name=model_name,
|
| 294 |
+
dataset=dataset,
|
| 295 |
+
verbose=args.verbose,
|
| 296 |
+
api_nproc=args.api_nproc,
|
| 297 |
+
ignore_failed=args.ignore)
|
| 298 |
+
else:
|
| 299 |
+
model = infer_data_job(
|
| 300 |
+
model,
|
| 301 |
+
work_dir=pred_root,
|
| 302 |
+
model_name=model_name,
|
| 303 |
+
dataset=dataset,
|
| 304 |
+
verbose=args.verbose,
|
| 305 |
+
api_nproc=args.api_nproc,
|
| 306 |
+
ignore_failed=args.ignore)
|
| 307 |
+
|
| 308 |
+
# Set the judge kwargs first before evaluation or dumping
|
| 309 |
+
|
| 310 |
+
judge_kwargs = {
|
| 311 |
+
'nproc': args.api_nproc,
|
| 312 |
+
'verbose': args.verbose,
|
| 313 |
+
'retry': args.retry if args.retry is not None else 3
|
| 314 |
+
}
|
| 315 |
+
|
| 316 |
+
if args.retry is not None:
|
| 317 |
+
judge_kwargs['retry'] = args.retry
|
| 318 |
+
if args.judge is not None:
|
| 319 |
+
judge_kwargs['model'] = args.judge
|
| 320 |
+
else:
|
| 321 |
+
if dataset.TYPE in ['MCQ', 'Y/N']:
|
| 322 |
+
judge_kwargs['model'] = 'chatgpt-0125'
|
| 323 |
+
elif listinstr(['MMVet', 'LLaVABench', 'MMBench-Video'], dataset_name):
|
| 324 |
+
judge_kwargs['model'] = 'gpt-4-turbo'
|
| 325 |
+
elif listinstr(['MathVista', 'MathVerse', 'MathVision', 'DynaMath', 'VL-RewardBench', 'WeMath', 'LogicVista'], dataset_name): # noqa: E501
|
| 326 |
+
judge_kwargs['model'] = 'gpt-4o-mini'
|
| 327 |
+
elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'SLIDEVQA', 'MIA-Bench', 'WildVision'], dataset_name): # noqa: E501
|
| 328 |
+
judge_kwargs['model'] = 'gpt-4o'
|
| 329 |
+
|
| 330 |
+
if rank == 0:
|
| 331 |
+
logger.info(judge_kwargs)
|
| 332 |
+
|
| 333 |
+
if world_size > 1:
|
| 334 |
+
dist.barrier()
|
| 335 |
+
|
| 336 |
+
# Only Rank 0 handles the evaluation part
|
| 337 |
+
if rank == 0:
|
| 338 |
+
# Prepare Submission Files for MMMU_TEST AND MMT-Bench_ALL
|
| 339 |
+
if dataset_name in ['MMMU_TEST']:
|
| 340 |
+
result_json = MMMU_result_transfer(result_file)
|
| 341 |
+
logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
|
| 342 |
+
f'json file saved in {result_json}')
|
| 343 |
+
continue
|
| 344 |
+
elif 'MMT-Bench_ALL' in dataset_name:
|
| 345 |
+
submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
|
| 346 |
+
logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
|
| 347 |
+
f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
|
| 348 |
+
f'submission file saved in {submission_file}')
|
| 349 |
+
continue
|
| 350 |
+
|
| 351 |
+
# Skip the evaluation part if only infer
|
| 352 |
+
if args.mode == 'infer':
|
| 353 |
+
continue
|
| 354 |
+
|
| 355 |
+
# Skip the evaluation part if the dataset evaluation is not supported or annotations are missing
|
| 356 |
+
if 'MLLMGuard_DS' in dataset_name:
|
| 357 |
+
logger.info('The evaluation of MLLMGuard_DS is not supported yet. ')
|
| 358 |
+
continue
|
| 359 |
+
elif 'AesBench_TEST' == dataset_name:
|
| 360 |
+
logger.info(f'The results are saved in {result_file}. '
|
| 361 |
+
f'Please send it to the AesBench Team via huangyipo@hotmail.com.')
|
| 362 |
+
continue
|
| 363 |
+
elif dataset_name in ['DocVQA_TEST', 'InfoVQA_TEST', 'Q-Bench1_TEST', 'A-Bench_TEST']:
|
| 364 |
+
logger.info(f'{dataset_name} is a test split without ground-truth. '
|
| 365 |
+
'Thus only the inference part is supported for those datasets. ')
|
| 366 |
+
continue
|
| 367 |
+
elif dataset_name in [
|
| 368 |
+
'MMBench_TEST_CN', 'MMBench_TEST_EN', 'MMBench', 'MMBench_CN',
|
| 369 |
+
'MMBench_TEST_CN_V11', 'MMBench_TEST_EN_V11', 'MMBench_V11', 'MMBench_CN_V11'
|
| 370 |
+
] and not MMBenchOfficialServer(dataset_name):
|
| 371 |
+
logger.error(
|
| 372 |
+
f'Can not evaluate {dataset_name} on non-official servers, will skip the evaluation.')
|
| 373 |
+
continue
|
| 374 |
+
|
| 375 |
+
# Setup the proxy for the evaluation
|
| 376 |
+
eval_proxy = os.environ.get('EVAL_PROXY', None)
|
| 377 |
+
old_proxy = os.environ.get('HTTP_PROXY', '')
|
| 378 |
+
if eval_proxy is not None:
|
| 379 |
+
proxy_set(eval_proxy)
|
| 380 |
+
|
| 381 |
+
# Perform the Evaluation
|
| 382 |
+
eval_results = dataset.evaluate(result_file, **judge_kwargs)
|
| 383 |
+
# Display Evaluation Results in Terminal
|
| 384 |
+
if eval_results is not None:
|
| 385 |
+
assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
|
| 386 |
+
logger.info(f'The evaluation of model {model_name} x dataset {dataset_name} has finished! ')
|
| 387 |
+
logger.info('Evaluation Results:')
|
| 388 |
+
if isinstance(eval_results, dict):
|
| 389 |
+
logger.info('\n' + json.dumps(eval_results, indent=4))
|
| 390 |
+
elif isinstance(eval_results, pd.DataFrame):
|
| 391 |
+
if len(eval_results) < len(eval_results.columns):
|
| 392 |
+
eval_results = eval_results.T
|
| 393 |
+
logger.info('\n' + tabulate(eval_results))
|
| 394 |
+
|
| 395 |
+
# Restore the proxy
|
| 396 |
+
if eval_proxy is not None:
|
| 397 |
+
proxy_set(old_proxy)
|
| 398 |
+
|
| 399 |
+
# Create the symbolic links for the prediction files
|
| 400 |
+
files = os.listdir(pred_root)
|
| 401 |
+
files = [x for x in files if (f'{model_name}_{dataset_name}' in x or "status.json" in x)]
|
| 402 |
+
for f in files:
|
| 403 |
+
cwd = os.getcwd()
|
| 404 |
+
file_addr = osp.join(cwd, pred_root, f)
|
| 405 |
+
link_addr = osp.join(cwd, pred_root_meta, f)
|
| 406 |
+
if osp.exists(link_addr) or osp.islink(link_addr):
|
| 407 |
+
os.remove(link_addr)
|
| 408 |
+
os.symlink(file_addr, link_addr)
|
| 409 |
+
|
| 410 |
+
except Exception as e:
|
| 411 |
+
logger.exception(f'Model {model_name} x Dataset {dataset_name} combination failed: {e}, '
|
| 412 |
+
'skipping this combination.')
|
| 413 |
+
continue
|
| 414 |
+
|
| 415 |
+
if world_size > 1:
|
| 416 |
+
dist.barrier()
|
| 417 |
+
|
| 418 |
+
if world_size > 1:
|
| 419 |
+
dist.destroy_process_group()
|
| 420 |
+
|
| 421 |
+
|
| 422 |
+
if __name__ == '__main__':
|
| 423 |
+
load_env()
|
| 424 |
+
main()
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/scripts/run_inference.sh
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
export PATH=/usr/local/cuda/bin:$PATH
|
| 2 |
+
|
| 3 |
+
export HF_ENDPOINT=https://hf-mirror.com
|
| 4 |
+
export OMP_NUM_THREADS=1
|
| 5 |
+
export timestamp=`date +"%Y%m%d%H%M%S"`
|
| 6 |
+
export OLD_VERSION='False'
|
| 7 |
+
export PYTHONPATH=$(dirname $SELF_DIR):$PYTHONPATH
|
| 8 |
+
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
|
| 9 |
+
|
| 10 |
+
# gpu consumed
|
| 11 |
+
# fp16 17-18G
|
| 12 |
+
# int4 7-8G
|
| 13 |
+
|
| 14 |
+
# model to be used
|
| 15 |
+
# Example: MODELNAME=MiniCPM-o-2_6
|
| 16 |
+
MODELNAME=$1
|
| 17 |
+
# datasets to be tested
|
| 18 |
+
# Example: DATALIST=MMMU_DEV_VAL
|
| 19 |
+
DATALIST=$2
|
| 20 |
+
|
| 21 |
+
# run on multi gpus with torchrun command
|
| 22 |
+
# remember to run twice, the first run may fail
|
| 23 |
+
for DATASET in $DATALIST; do
|
| 24 |
+
echo "Starting inference with model $MODELNAME on dataset $DATASET"
|
| 25 |
+
torchrun --master_port 29500 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
|
| 26 |
+
torchrun --master_port 29501 --nproc_per_node=8 run.py --data $DATASET --model $MODELNAME --mode infer --reuse
|
| 27 |
+
|
| 28 |
+
# for benchmarks which require gpt for scoring, you need to specify OPENAI_API_BASE and OPENAI_API_KEY in .env file
|
| 29 |
+
if [[ "$DATASET" == *"MMBench_TEST"*]]; then
|
| 30 |
+
echo "Skipping evaluation for dataset $DATASET"
|
| 31 |
+
else
|
| 32 |
+
echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
|
| 33 |
+
python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
|
| 34 |
+
fi
|
| 35 |
+
done
|
| 36 |
+
|
| 37 |
+
# run on single gpu with python command
|
| 38 |
+
# python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
|
| 39 |
+
# python run.py --data $DATALIST --model $MODELNAME --verbose --mode infer
|
| 40 |
+
# echo "Starting evaluation with model $MODELNAME on datasets $DATASET"
|
| 41 |
+
# python run.py --data $DATASET --model $MODELNAME --nproc 16 --verbose
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/setup.py
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
import sys
|
| 3 |
+
from os.path import exists
|
| 4 |
+
from setuptools import find_packages, setup
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
def parse_requirements(fname='requirements.txt', with_version=True):
|
| 8 |
+
"""Parse the package dependencies listed in a requirements file but strips
|
| 9 |
+
specific versioning information.
|
| 10 |
+
|
| 11 |
+
Args:
|
| 12 |
+
fname (str): path to requirements file
|
| 13 |
+
with_version (bool, default=False): if True include version specs
|
| 14 |
+
|
| 15 |
+
Returns:
|
| 16 |
+
List[str]: list of requirements items
|
| 17 |
+
|
| 18 |
+
CommandLine:
|
| 19 |
+
python -c "import setup; print(setup.parse_requirements())"
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
require_fpath = fname
|
| 23 |
+
|
| 24 |
+
def parse_line(line):
|
| 25 |
+
"""Parse information from a line in a requirements text file."""
|
| 26 |
+
if line.startswith('-r '):
|
| 27 |
+
# Allow specifying requirements in other files
|
| 28 |
+
target = line.split(' ')[1]
|
| 29 |
+
for info in parse_require_file(target):
|
| 30 |
+
yield info
|
| 31 |
+
else:
|
| 32 |
+
info = {'line': line}
|
| 33 |
+
if line.startswith('-e '):
|
| 34 |
+
info['package'] = line.split('#egg=')[1]
|
| 35 |
+
elif '@git+' in line:
|
| 36 |
+
info['package'] = line
|
| 37 |
+
else:
|
| 38 |
+
# Remove versioning from the package
|
| 39 |
+
pat = '(' + '|'.join(['>=', '==', '>']) + ')'
|
| 40 |
+
parts = re.split(pat, line, maxsplit=1)
|
| 41 |
+
parts = [p.strip() for p in parts]
|
| 42 |
+
|
| 43 |
+
info['package'] = parts[0]
|
| 44 |
+
if len(parts) > 1:
|
| 45 |
+
op, rest = parts[1:]
|
| 46 |
+
if ';' in rest:
|
| 47 |
+
# Handle platform specific dependencies
|
| 48 |
+
# http://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-platform-specific-dependencies
|
| 49 |
+
version, platform_deps = map(str.strip,
|
| 50 |
+
rest.split(';'))
|
| 51 |
+
info['platform_deps'] = platform_deps
|
| 52 |
+
else:
|
| 53 |
+
version = rest # NOQA
|
| 54 |
+
info['version'] = (op, version)
|
| 55 |
+
yield info
|
| 56 |
+
|
| 57 |
+
def parse_require_file(fpath):
|
| 58 |
+
with open(fpath, 'r') as f:
|
| 59 |
+
for line in f.readlines():
|
| 60 |
+
line = line.strip()
|
| 61 |
+
if line and not line.startswith('#'):
|
| 62 |
+
for info in parse_line(line):
|
| 63 |
+
yield info
|
| 64 |
+
|
| 65 |
+
def gen_packages_items():
|
| 66 |
+
if exists(require_fpath):
|
| 67 |
+
for info in parse_require_file(require_fpath):
|
| 68 |
+
parts = [info['package']]
|
| 69 |
+
if with_version and 'version' in info:
|
| 70 |
+
parts.extend(info['version'])
|
| 71 |
+
if not sys.version.startswith('3.4'):
|
| 72 |
+
# apparently package_deps are broken in 3.4
|
| 73 |
+
platform_deps = info.get('platform_deps')
|
| 74 |
+
if platform_deps is not None:
|
| 75 |
+
parts.append(';' + platform_deps)
|
| 76 |
+
item = ''.join(parts)
|
| 77 |
+
yield item
|
| 78 |
+
|
| 79 |
+
packages = list(gen_packages_items())
|
| 80 |
+
return packages
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
with open('README.md') as f:
|
| 84 |
+
readme = f.read()
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def do_setup():
|
| 88 |
+
setup(
|
| 89 |
+
name='vlmeval',
|
| 90 |
+
version='0.1.0',
|
| 91 |
+
description='OpenCompass VLM Evaluation Kit',
|
| 92 |
+
author='Haodong Duan',
|
| 93 |
+
author_email='dhd.efz@gmail.com',
|
| 94 |
+
maintainer='Haodong Duan',
|
| 95 |
+
maintainer_email='dhd.efz@gmail.com',
|
| 96 |
+
long_description=readme,
|
| 97 |
+
long_description_content_type='text/markdown',
|
| 98 |
+
cmdclass={},
|
| 99 |
+
install_requires=parse_requirements('requirements.txt'),
|
| 100 |
+
setup_requires=[],
|
| 101 |
+
python_requires='>=3.7.0',
|
| 102 |
+
packages=find_packages(exclude=[
|
| 103 |
+
'test*',
|
| 104 |
+
'paper_test*',
|
| 105 |
+
]),
|
| 106 |
+
keywords=['AI', 'NLP', 'in-context learning'],
|
| 107 |
+
entry_points={
|
| 108 |
+
'console_scripts': ['vlmutil = vlmeval:cli']
|
| 109 |
+
},
|
| 110 |
+
classifiers=[
|
| 111 |
+
'Programming Language :: Python :: 3.7',
|
| 112 |
+
'Programming Language :: Python :: 3.8',
|
| 113 |
+
'Programming Language :: Python :: 3.9',
|
| 114 |
+
'Programming Language :: Python :: 3.10',
|
| 115 |
+
'Intended Audience :: Developers',
|
| 116 |
+
'Intended Audience :: Education',
|
| 117 |
+
'Intended Audience :: Science/Research',
|
| 118 |
+
])
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
if __name__ == '__main__':
|
| 122 |
+
do_setup()
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/__init__.py
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
try:
|
| 2 |
+
import torch
|
| 3 |
+
except ImportError:
|
| 4 |
+
pass
|
| 5 |
+
|
| 6 |
+
from .smp import *
|
| 7 |
+
from .api import *
|
| 8 |
+
from .dataset import *
|
| 9 |
+
from .utils import *
|
| 10 |
+
from .vlm import *
|
| 11 |
+
from .config import *
|
| 12 |
+
from .tools import cli
|
| 13 |
+
|
| 14 |
+
load_env()
|
| 15 |
+
|
| 16 |
+
__version__ = '0.2rc1'
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/__init__.py
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from .gpt import OpenAIWrapper, GPT4V
|
| 2 |
+
|
| 3 |
+
__all__ = [
|
| 4 |
+
'OpenAIWrapper', 'GPT4V',
|
| 5 |
+
]
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/base.py
ADDED
|
@@ -0,0 +1,289 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import time
|
| 2 |
+
import random as rd
|
| 3 |
+
from abc import abstractmethod
|
| 4 |
+
import os.path as osp
|
| 5 |
+
import copy as cp
|
| 6 |
+
from ..smp import get_logger, parse_file, concat_images_vlmeval, LMUDataRoot, md5, decode_base64_to_image_file
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
class BaseAPI:
|
| 10 |
+
|
| 11 |
+
allowed_types = ['text', 'image']
|
| 12 |
+
INTERLEAVE = True
|
| 13 |
+
INSTALL_REQ = False
|
| 14 |
+
|
| 15 |
+
def __init__(self,
|
| 16 |
+
retry=10,
|
| 17 |
+
wait=3,
|
| 18 |
+
system_prompt=None,
|
| 19 |
+
verbose=True,
|
| 20 |
+
fail_msg='Failed to obtain answer via API.',
|
| 21 |
+
**kwargs):
|
| 22 |
+
"""Base Class for all APIs.
|
| 23 |
+
|
| 24 |
+
Args:
|
| 25 |
+
retry (int, optional): The retry times for `generate_inner`. Defaults to 10.
|
| 26 |
+
wait (int, optional): The wait time after each failed retry of `generate_inner`. Defaults to 3.
|
| 27 |
+
system_prompt (str, optional): Defaults to None.
|
| 28 |
+
verbose (bool, optional): Defaults to True.
|
| 29 |
+
fail_msg (str, optional): The message to return when failed to obtain answer.
|
| 30 |
+
Defaults to 'Failed to obtain answer via API.'.
|
| 31 |
+
**kwargs: Other kwargs for `generate_inner`.
|
| 32 |
+
"""
|
| 33 |
+
|
| 34 |
+
self.wait = wait
|
| 35 |
+
self.retry = retry
|
| 36 |
+
self.system_prompt = system_prompt
|
| 37 |
+
self.verbose = verbose
|
| 38 |
+
self.fail_msg = fail_msg
|
| 39 |
+
self.logger = get_logger('ChatAPI')
|
| 40 |
+
|
| 41 |
+
if len(kwargs):
|
| 42 |
+
self.logger.info(f'BaseAPI received the following kwargs: {kwargs}')
|
| 43 |
+
self.logger.info('Will try to use them as kwargs for `generate`. ')
|
| 44 |
+
self.default_kwargs = kwargs
|
| 45 |
+
|
| 46 |
+
@abstractmethod
|
| 47 |
+
def generate_inner(self, inputs, **kwargs):
|
| 48 |
+
"""The inner function to generate the answer.
|
| 49 |
+
|
| 50 |
+
Returns:
|
| 51 |
+
tuple(int, str, str): ret_code, response, log
|
| 52 |
+
"""
|
| 53 |
+
self.logger.warning('For APIBase, generate_inner is an abstract method. ')
|
| 54 |
+
assert 0, 'generate_inner not defined'
|
| 55 |
+
ret_code, answer, log = None, None, None
|
| 56 |
+
# if ret_code is 0, means succeed
|
| 57 |
+
return ret_code, answer, log
|
| 58 |
+
|
| 59 |
+
def working(self):
|
| 60 |
+
"""If the API model is working, return True, else return False.
|
| 61 |
+
|
| 62 |
+
Returns:
|
| 63 |
+
bool: If the API model is working, return True, else return False.
|
| 64 |
+
"""
|
| 65 |
+
self.old_timeout = None
|
| 66 |
+
if hasattr(self, 'timeout'):
|
| 67 |
+
self.old_timeout = self.timeout
|
| 68 |
+
self.timeout = 120
|
| 69 |
+
|
| 70 |
+
retry = 5
|
| 71 |
+
while retry > 0:
|
| 72 |
+
ret = self.generate('hello')
|
| 73 |
+
if ret is not None and ret != '' and self.fail_msg not in ret:
|
| 74 |
+
if self.old_timeout is not None:
|
| 75 |
+
self.timeout = self.old_timeout
|
| 76 |
+
return True
|
| 77 |
+
retry -= 1
|
| 78 |
+
|
| 79 |
+
if self.old_timeout is not None:
|
| 80 |
+
self.timeout = self.old_timeout
|
| 81 |
+
return False
|
| 82 |
+
|
| 83 |
+
def check_content(self, msgs):
|
| 84 |
+
"""Check the content type of the input. Four types are allowed: str, dict, liststr, listdict.
|
| 85 |
+
|
| 86 |
+
Args:
|
| 87 |
+
msgs: Raw input messages.
|
| 88 |
+
|
| 89 |
+
Returns:
|
| 90 |
+
str: The message type.
|
| 91 |
+
"""
|
| 92 |
+
if isinstance(msgs, str):
|
| 93 |
+
return 'str'
|
| 94 |
+
if isinstance(msgs, dict):
|
| 95 |
+
return 'dict'
|
| 96 |
+
if isinstance(msgs, list):
|
| 97 |
+
types = [self.check_content(m) for m in msgs]
|
| 98 |
+
if all(t == 'str' for t in types):
|
| 99 |
+
return 'liststr'
|
| 100 |
+
if all(t == 'dict' for t in types):
|
| 101 |
+
return 'listdict'
|
| 102 |
+
return 'unknown'
|
| 103 |
+
|
| 104 |
+
def preproc_content(self, inputs):
|
| 105 |
+
"""Convert the raw input messages to a list of dicts.
|
| 106 |
+
|
| 107 |
+
Args:
|
| 108 |
+
inputs: raw input messages.
|
| 109 |
+
|
| 110 |
+
Returns:
|
| 111 |
+
list(dict): The preprocessed input messages. Will return None if failed to preprocess the input.
|
| 112 |
+
"""
|
| 113 |
+
if self.check_content(inputs) == 'str':
|
| 114 |
+
return [dict(type='text', value=inputs)]
|
| 115 |
+
elif self.check_content(inputs) == 'dict':
|
| 116 |
+
assert 'type' in inputs and 'value' in inputs
|
| 117 |
+
return [inputs]
|
| 118 |
+
elif self.check_content(inputs) == 'liststr':
|
| 119 |
+
res = []
|
| 120 |
+
for s in inputs:
|
| 121 |
+
mime, pth = parse_file(s)
|
| 122 |
+
if mime is None or mime == 'unknown':
|
| 123 |
+
res.append(dict(type='text', value=s))
|
| 124 |
+
else:
|
| 125 |
+
res.append(dict(type=mime.split('/')[0], value=pth))
|
| 126 |
+
return res
|
| 127 |
+
elif self.check_content(inputs) == 'listdict':
|
| 128 |
+
for item in inputs:
|
| 129 |
+
assert 'type' in item and 'value' in item
|
| 130 |
+
mime, s = parse_file(item['value'])
|
| 131 |
+
if mime is None:
|
| 132 |
+
assert item['type'] == 'text', item['value']
|
| 133 |
+
else:
|
| 134 |
+
assert mime.split('/')[0] == item['type']
|
| 135 |
+
item['value'] = s
|
| 136 |
+
return inputs
|
| 137 |
+
else:
|
| 138 |
+
return None
|
| 139 |
+
|
| 140 |
+
# May exceed the context windows size, so try with different turn numbers.
|
| 141 |
+
def chat_inner(self, inputs, **kwargs):
|
| 142 |
+
_ = kwargs.pop('dataset', None)
|
| 143 |
+
while len(inputs):
|
| 144 |
+
try:
|
| 145 |
+
return self.generate_inner(inputs, **kwargs)
|
| 146 |
+
except Exception as e:
|
| 147 |
+
if self.verbose:
|
| 148 |
+
self.logger.info(f'{type(e)}: {e}')
|
| 149 |
+
inputs = inputs[1:]
|
| 150 |
+
while len(inputs) and inputs[0]['role'] != 'user':
|
| 151 |
+
inputs = inputs[1:]
|
| 152 |
+
continue
|
| 153 |
+
return -1, self.fail_msg + ': ' + 'Failed with all possible conversation turns.', None
|
| 154 |
+
|
| 155 |
+
def chat(self, messages, **kwargs1):
|
| 156 |
+
"""The main function for multi-turn chatting. Will call `chat_inner` with the preprocessed input messages."""
|
| 157 |
+
assert hasattr(self, 'chat_inner'), 'The API model should has the `chat_inner` method. '
|
| 158 |
+
for msg in messages:
|
| 159 |
+
assert isinstance(msg, dict) and 'role' in msg and 'content' in msg, msg
|
| 160 |
+
assert self.check_content(msg['content']) in ['str', 'dict', 'liststr', 'listdict'], msg
|
| 161 |
+
msg['content'] = self.preproc_content(msg['content'])
|
| 162 |
+
# merge kwargs
|
| 163 |
+
kwargs = cp.deepcopy(self.default_kwargs)
|
| 164 |
+
kwargs.update(kwargs1)
|
| 165 |
+
|
| 166 |
+
answer = None
|
| 167 |
+
# a very small random delay [0s - 0.5s]
|
| 168 |
+
T = rd.random() * 0.5
|
| 169 |
+
time.sleep(T)
|
| 170 |
+
|
| 171 |
+
assert messages[-1]['role'] == 'user'
|
| 172 |
+
|
| 173 |
+
for i in range(self.retry):
|
| 174 |
+
try:
|
| 175 |
+
ret_code, answer, log = self.chat_inner(messages, **kwargs)
|
| 176 |
+
if ret_code == 0 and self.fail_msg not in answer and answer != '':
|
| 177 |
+
if self.verbose:
|
| 178 |
+
print(answer)
|
| 179 |
+
return answer
|
| 180 |
+
elif self.verbose:
|
| 181 |
+
if not isinstance(log, str):
|
| 182 |
+
try:
|
| 183 |
+
log = log.text
|
| 184 |
+
except Exception as e:
|
| 185 |
+
self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
|
| 186 |
+
self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
|
| 187 |
+
except Exception as err:
|
| 188 |
+
if self.verbose:
|
| 189 |
+
self.logger.error(f'An error occured during try {i}: ')
|
| 190 |
+
self.logger.error(f'{type(err)}: {err}')
|
| 191 |
+
# delay before each retry
|
| 192 |
+
T = rd.random() * self.wait * 2
|
| 193 |
+
time.sleep(T)
|
| 194 |
+
|
| 195 |
+
return self.fail_msg if answer in ['', None] else answer
|
| 196 |
+
|
| 197 |
+
def preprocess_message_with_role(self, message):
|
| 198 |
+
system_prompt = ''
|
| 199 |
+
new_message = []
|
| 200 |
+
|
| 201 |
+
for data in message:
|
| 202 |
+
assert isinstance(data, dict)
|
| 203 |
+
role = data.pop('role', 'user')
|
| 204 |
+
if role == 'system':
|
| 205 |
+
system_prompt += data['value'] + '\n'
|
| 206 |
+
else:
|
| 207 |
+
new_message.append(data)
|
| 208 |
+
|
| 209 |
+
if system_prompt != '':
|
| 210 |
+
if self.system_prompt is None:
|
| 211 |
+
self.system_prompt = system_prompt
|
| 212 |
+
else:
|
| 213 |
+
self.system_prompt += '\n' + system_prompt
|
| 214 |
+
return new_message
|
| 215 |
+
|
| 216 |
+
def generate(self, message, **kwargs1):
|
| 217 |
+
"""The main function to generate the answer. Will call `generate_inner` with the preprocessed input messages.
|
| 218 |
+
|
| 219 |
+
Args:
|
| 220 |
+
message: raw input messages.
|
| 221 |
+
|
| 222 |
+
Returns:
|
| 223 |
+
str: The generated answer of the Failed Message if failed to obtain answer.
|
| 224 |
+
"""
|
| 225 |
+
if self.check_content(message) == 'listdict':
|
| 226 |
+
message = self.preprocess_message_with_role(message)
|
| 227 |
+
|
| 228 |
+
assert self.check_content(message) in ['str', 'dict', 'liststr', 'listdict'], f'Invalid input type: {message}'
|
| 229 |
+
message = self.preproc_content(message)
|
| 230 |
+
assert message is not None and self.check_content(message) == 'listdict'
|
| 231 |
+
for item in message:
|
| 232 |
+
assert item['type'] in self.allowed_types, f'Invalid input type: {item["type"]}'
|
| 233 |
+
|
| 234 |
+
# merge kwargs
|
| 235 |
+
kwargs = cp.deepcopy(self.default_kwargs)
|
| 236 |
+
kwargs.update(kwargs1)
|
| 237 |
+
|
| 238 |
+
answer = None
|
| 239 |
+
# a very small random delay [0s - 0.5s]
|
| 240 |
+
T = rd.random() * 0.5
|
| 241 |
+
time.sleep(T)
|
| 242 |
+
|
| 243 |
+
for i in range(self.retry):
|
| 244 |
+
try:
|
| 245 |
+
ret_code, answer, log = self.generate_inner(message, **kwargs)
|
| 246 |
+
if ret_code == 0 and self.fail_msg not in answer and answer != '':
|
| 247 |
+
if self.verbose:
|
| 248 |
+
print(answer)
|
| 249 |
+
return answer
|
| 250 |
+
elif self.verbose:
|
| 251 |
+
if not isinstance(log, str):
|
| 252 |
+
try:
|
| 253 |
+
log = log.text
|
| 254 |
+
except Exception as e:
|
| 255 |
+
self.logger.warning(f'Failed to parse {log} as an http response: {str(e)}. ')
|
| 256 |
+
self.logger.info(f'RetCode: {ret_code}\nAnswer: {answer}\nLog: {log}')
|
| 257 |
+
except Exception as err:
|
| 258 |
+
if self.verbose:
|
| 259 |
+
self.logger.error(f'An error occured during try {i}: ')
|
| 260 |
+
self.logger.error(f'{type(err)}: {err}')
|
| 261 |
+
# delay before each retry
|
| 262 |
+
T = rd.random() * self.wait * 2
|
| 263 |
+
time.sleep(T)
|
| 264 |
+
|
| 265 |
+
return self.fail_msg if answer in ['', None] else answer
|
| 266 |
+
|
| 267 |
+
def message_to_promptimg(self, message, dataset=None):
|
| 268 |
+
assert not self.INTERLEAVE
|
| 269 |
+
model_name = self.__class__.__name__
|
| 270 |
+
import warnings
|
| 271 |
+
warnings.warn(
|
| 272 |
+
f'Model {model_name} does not support interleaved input. '
|
| 273 |
+
'Will use the first image and aggregated texts as prompt. ')
|
| 274 |
+
num_images = len([x for x in message if x['type'] == 'image'])
|
| 275 |
+
if num_images == 0:
|
| 276 |
+
prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
|
| 277 |
+
image = None
|
| 278 |
+
elif num_images == 1:
|
| 279 |
+
prompt = '\n'.join([x['value'] for x in message if x['type'] == 'text'])
|
| 280 |
+
image = [x['value'] for x in message if x['type'] == 'image'][0]
|
| 281 |
+
else:
|
| 282 |
+
prompt = '\n'.join([x['value'] if x['type'] == 'text' else '<image>' for x in message])
|
| 283 |
+
if dataset == 'BLINK':
|
| 284 |
+
image = concat_images_vlmeval(
|
| 285 |
+
[x['value'] for x in message if x['type'] == 'image'],
|
| 286 |
+
target_size=512)
|
| 287 |
+
else:
|
| 288 |
+
image = [x['value'] for x in message if x['type'] == 'image'][0]
|
| 289 |
+
return prompt, image
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/api/gpt.py
ADDED
|
@@ -0,0 +1,267 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from ..smp import *
|
| 2 |
+
import os
|
| 3 |
+
import sys
|
| 4 |
+
from .base import BaseAPI
|
| 5 |
+
|
| 6 |
+
APIBASES = {
|
| 7 |
+
'OFFICIAL': 'https://api.openai.com/v1/chat/completions',
|
| 8 |
+
}
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def GPT_context_window(model):
|
| 12 |
+
length_map = {
|
| 13 |
+
'gpt-4': 8192,
|
| 14 |
+
'gpt-4-0613': 8192,
|
| 15 |
+
'gpt-4-turbo-preview': 128000,
|
| 16 |
+
'gpt-4-1106-preview': 128000,
|
| 17 |
+
'gpt-4-0125-preview': 128000,
|
| 18 |
+
'gpt-4-vision-preview': 128000,
|
| 19 |
+
'gpt-4-turbo': 128000,
|
| 20 |
+
'gpt-4-turbo-2024-04-09': 128000,
|
| 21 |
+
'gpt-3.5-turbo': 16385,
|
| 22 |
+
'gpt-3.5-turbo-0125': 16385,
|
| 23 |
+
'gpt-3.5-turbo-1106': 16385,
|
| 24 |
+
'gpt-3.5-turbo-instruct': 4096,
|
| 25 |
+
}
|
| 26 |
+
if model in length_map:
|
| 27 |
+
return length_map[model]
|
| 28 |
+
else:
|
| 29 |
+
return 128000
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
class OpenAIWrapper(BaseAPI):
|
| 33 |
+
|
| 34 |
+
is_api: bool = True
|
| 35 |
+
|
| 36 |
+
def __init__(self,
|
| 37 |
+
model: str = 'gpt-3.5-turbo-0613',
|
| 38 |
+
retry: int = 5,
|
| 39 |
+
wait: int = 5,
|
| 40 |
+
key: str = None,
|
| 41 |
+
verbose: bool = False,
|
| 42 |
+
system_prompt: str = None,
|
| 43 |
+
temperature: float = 0,
|
| 44 |
+
timeout: int = 60,
|
| 45 |
+
api_base: str = None,
|
| 46 |
+
max_tokens: int = 1024,
|
| 47 |
+
img_size: int = 512,
|
| 48 |
+
img_detail: str = 'low',
|
| 49 |
+
use_azure: bool = False,
|
| 50 |
+
**kwargs):
|
| 51 |
+
|
| 52 |
+
self.model = model
|
| 53 |
+
self.cur_idx = 0
|
| 54 |
+
self.fail_msg = 'Failed to obtain answer via API. '
|
| 55 |
+
self.max_tokens = max_tokens
|
| 56 |
+
self.temperature = temperature
|
| 57 |
+
self.use_azure = use_azure
|
| 58 |
+
|
| 59 |
+
if 'step' in model:
|
| 60 |
+
env_key = os.environ.get('STEPAI_API_KEY', '')
|
| 61 |
+
if key is None:
|
| 62 |
+
key = env_key
|
| 63 |
+
elif 'yi-vision' in model:
|
| 64 |
+
env_key = os.environ.get('YI_API_KEY', '')
|
| 65 |
+
if key is None:
|
| 66 |
+
key = env_key
|
| 67 |
+
elif 'internvl2-pro' in model:
|
| 68 |
+
env_key = os.environ.get('InternVL2_PRO_KEY', '')
|
| 69 |
+
if key is None:
|
| 70 |
+
key = env_key
|
| 71 |
+
elif 'abab' in model:
|
| 72 |
+
env_key = os.environ.get('MiniMax_API_KEY', '')
|
| 73 |
+
if key is None:
|
| 74 |
+
key = env_key
|
| 75 |
+
else:
|
| 76 |
+
if use_azure:
|
| 77 |
+
env_key = os.environ.get('AZURE_OPENAI_API_KEY', None)
|
| 78 |
+
assert env_key is not None, 'Please set the environment variable AZURE_OPENAI_API_KEY. '
|
| 79 |
+
|
| 80 |
+
if key is None:
|
| 81 |
+
key = env_key
|
| 82 |
+
assert isinstance(key, str), (
|
| 83 |
+
'Please set the environment variable AZURE_OPENAI_API_KEY to your openai key. '
|
| 84 |
+
)
|
| 85 |
+
else:
|
| 86 |
+
env_key = os.environ.get('OPENAI_API_KEY', '')
|
| 87 |
+
if key is None:
|
| 88 |
+
key = env_key
|
| 89 |
+
assert isinstance(key, str) and key.startswith('sk-'), (
|
| 90 |
+
f'Illegal openai_key {key}. '
|
| 91 |
+
'Please set the environment variable OPENAI_API_KEY to your openai key. '
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
self.key = key
|
| 95 |
+
assert img_size > 0 or img_size == -1
|
| 96 |
+
self.img_size = img_size
|
| 97 |
+
assert img_detail in ['high', 'low']
|
| 98 |
+
self.img_detail = img_detail
|
| 99 |
+
self.timeout = timeout
|
| 100 |
+
|
| 101 |
+
super().__init__(wait=wait, retry=retry, system_prompt=system_prompt, verbose=verbose, **kwargs)
|
| 102 |
+
|
| 103 |
+
if use_azure:
|
| 104 |
+
api_base_template = (
|
| 105 |
+
'{endpoint}openai/deployments/{deployment_name}/chat/completions?api-version={api_version}'
|
| 106 |
+
)
|
| 107 |
+
endpoint = os.getenv('AZURE_OPENAI_ENDPOINT', None)
|
| 108 |
+
assert endpoint is not None, 'Please set the environment variable AZURE_OPENAI_ENDPOINT. '
|
| 109 |
+
deployment_name = os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME', None)
|
| 110 |
+
assert deployment_name is not None, 'Please set the environment variable AZURE_OPENAI_DEPLOYMENT_NAME. '
|
| 111 |
+
api_version = os.getenv('OPENAI_API_VERSION', None)
|
| 112 |
+
assert api_version is not None, 'Please set the environment variable OPENAI_API_VERSION. '
|
| 113 |
+
|
| 114 |
+
self.api_base = api_base_template.format(
|
| 115 |
+
endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
|
| 116 |
+
deployment_name=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
|
| 117 |
+
api_version=os.getenv('OPENAI_API_VERSION')
|
| 118 |
+
)
|
| 119 |
+
else:
|
| 120 |
+
if api_base is None:
|
| 121 |
+
if 'OPENAI_API_BASE' in os.environ and os.environ['OPENAI_API_BASE'] != '':
|
| 122 |
+
self.logger.info('Environment variable OPENAI_API_BASE is set. Will use it as api_base. ')
|
| 123 |
+
api_base = os.environ['OPENAI_API_BASE']
|
| 124 |
+
else:
|
| 125 |
+
api_base = 'OFFICIAL'
|
| 126 |
+
|
| 127 |
+
assert api_base is not None
|
| 128 |
+
|
| 129 |
+
if api_base in APIBASES:
|
| 130 |
+
self.api_base = APIBASES[api_base]
|
| 131 |
+
elif api_base.startswith('http'):
|
| 132 |
+
self.api_base = api_base
|
| 133 |
+
else:
|
| 134 |
+
self.logger.error('Unknown API Base. ')
|
| 135 |
+
raise NotImplementedError
|
| 136 |
+
|
| 137 |
+
self.logger.info(f'Using API Base: {self.api_base}; API Key: {self.key}')
|
| 138 |
+
|
| 139 |
+
# inputs can be a lvl-2 nested list: [content1, content2, content3, ...]
|
| 140 |
+
# content can be a string or a list of image & text
|
| 141 |
+
def prepare_itlist(self, inputs):
|
| 142 |
+
assert np.all([isinstance(x, dict) for x in inputs])
|
| 143 |
+
has_images = np.sum([x['type'] == 'image' for x in inputs])
|
| 144 |
+
if has_images:
|
| 145 |
+
content_list = []
|
| 146 |
+
for msg in inputs:
|
| 147 |
+
if msg['type'] == 'text':
|
| 148 |
+
content_list.append(dict(type='text', text=msg['value']))
|
| 149 |
+
elif msg['type'] == 'image':
|
| 150 |
+
from PIL import Image
|
| 151 |
+
img = Image.open(msg['value'])
|
| 152 |
+
b64 = encode_image_to_base64(img, target_size=self.img_size)
|
| 153 |
+
img_struct = dict(url=f'data:image/jpeg;base64,{b64}', detail=self.img_detail)
|
| 154 |
+
content_list.append(dict(type='image_url', image_url=img_struct))
|
| 155 |
+
else:
|
| 156 |
+
assert all([x['type'] == 'text' for x in inputs])
|
| 157 |
+
text = '\n'.join([x['value'] for x in inputs])
|
| 158 |
+
content_list = [dict(type='text', text=text)]
|
| 159 |
+
return content_list
|
| 160 |
+
|
| 161 |
+
def prepare_inputs(self, inputs):
|
| 162 |
+
input_msgs = []
|
| 163 |
+
if self.system_prompt is not None:
|
| 164 |
+
input_msgs.append(dict(role='system', content=self.system_prompt))
|
| 165 |
+
assert isinstance(inputs, list) and isinstance(inputs[0], dict)
|
| 166 |
+
assert np.all(['type' in x for x in inputs]) or np.all(['role' in x for x in inputs]), inputs
|
| 167 |
+
if 'role' in inputs[0]:
|
| 168 |
+
assert inputs[-1]['role'] == 'user', inputs[-1]
|
| 169 |
+
for item in inputs:
|
| 170 |
+
input_msgs.append(dict(role=item['role'], content=self.prepare_itlist(item['content'])))
|
| 171 |
+
else:
|
| 172 |
+
input_msgs.append(dict(role='user', content=self.prepare_itlist(inputs)))
|
| 173 |
+
return input_msgs
|
| 174 |
+
|
| 175 |
+
def generate_inner(self, inputs, **kwargs) -> str:
|
| 176 |
+
input_msgs = self.prepare_inputs(inputs)
|
| 177 |
+
temperature = kwargs.pop('temperature', self.temperature)
|
| 178 |
+
max_tokens = kwargs.pop('max_tokens', self.max_tokens)
|
| 179 |
+
|
| 180 |
+
# context_window = GPT_context_window(self.model)
|
| 181 |
+
# new_max_tokens = min(max_tokens, context_window - self.get_token_len(inputs))
|
| 182 |
+
# if 0 < new_max_tokens <= 100 and new_max_tokens < max_tokens:
|
| 183 |
+
# self.logger.warning(
|
| 184 |
+
# 'Less than 100 tokens left, '
|
| 185 |
+
# 'may exceed the context window with some additional meta symbols. '
|
| 186 |
+
# )
|
| 187 |
+
# if new_max_tokens <= 0:
|
| 188 |
+
# return 0, self.fail_msg + 'Input string longer than context window. ', 'Length Exceeded. '
|
| 189 |
+
# max_tokens = new_max_tokens
|
| 190 |
+
|
| 191 |
+
# Will send request if use Azure, dk how to use openai client for it
|
| 192 |
+
if self.use_azure:
|
| 193 |
+
headers = {'Content-Type': 'application/json', 'api-key': self.key}
|
| 194 |
+
elif 'internvl2-pro' in self.model:
|
| 195 |
+
headers = {'Content-Type': 'application/json', 'Authorization': self.key}
|
| 196 |
+
else:
|
| 197 |
+
headers = {'Content-Type': 'application/json', 'Authorization': f'Bearer {self.key}'}
|
| 198 |
+
payload = dict(
|
| 199 |
+
model=self.model,
|
| 200 |
+
messages=input_msgs,
|
| 201 |
+
max_tokens=max_tokens,
|
| 202 |
+
n=1,
|
| 203 |
+
temperature=temperature,
|
| 204 |
+
**kwargs)
|
| 205 |
+
response = requests.post(
|
| 206 |
+
self.api_base,
|
| 207 |
+
headers=headers, data=json.dumps(payload), timeout=self.timeout * 1.1)
|
| 208 |
+
ret_code = response.status_code
|
| 209 |
+
ret_code = 0 if (200 <= int(ret_code) < 300) else ret_code
|
| 210 |
+
answer = self.fail_msg
|
| 211 |
+
try:
|
| 212 |
+
resp_struct = json.loads(response.text)
|
| 213 |
+
answer = resp_struct['choices'][0]['message']['content'].strip()
|
| 214 |
+
except Exception as err:
|
| 215 |
+
if self.verbose:
|
| 216 |
+
self.logger.error(f'{type(err)}: {err}')
|
| 217 |
+
self.logger.error(response.text if hasattr(response, 'text') else response)
|
| 218 |
+
|
| 219 |
+
return ret_code, answer, response
|
| 220 |
+
|
| 221 |
+
def get_image_token_len(self, img_path, detail='low'):
|
| 222 |
+
import math
|
| 223 |
+
if detail == 'low':
|
| 224 |
+
return 85
|
| 225 |
+
|
| 226 |
+
im = Image.open(img_path)
|
| 227 |
+
height, width = im.size
|
| 228 |
+
if width > 1024 or height > 1024:
|
| 229 |
+
if width > height:
|
| 230 |
+
height = int(height * 1024 / width)
|
| 231 |
+
width = 1024
|
| 232 |
+
else:
|
| 233 |
+
width = int(width * 1024 / height)
|
| 234 |
+
height = 1024
|
| 235 |
+
|
| 236 |
+
h = math.ceil(height / 512)
|
| 237 |
+
w = math.ceil(width / 512)
|
| 238 |
+
total = 85 + 170 * h * w
|
| 239 |
+
return total
|
| 240 |
+
|
| 241 |
+
def get_token_len(self, inputs) -> int:
|
| 242 |
+
import tiktoken
|
| 243 |
+
try:
|
| 244 |
+
enc = tiktoken.encoding_for_model(self.model)
|
| 245 |
+
except Exception as err:
|
| 246 |
+
if 'gpt' in self.model.lower():
|
| 247 |
+
if self.verbose:
|
| 248 |
+
self.logger.warning(f'{type(err)}: {err}')
|
| 249 |
+
enc = tiktoken.encoding_for_model('gpt-4')
|
| 250 |
+
else:
|
| 251 |
+
return 0
|
| 252 |
+
assert isinstance(inputs, list)
|
| 253 |
+
tot = 0
|
| 254 |
+
for item in inputs:
|
| 255 |
+
if 'role' in item:
|
| 256 |
+
tot += self.get_token_len(item['content'])
|
| 257 |
+
elif item['type'] == 'text':
|
| 258 |
+
tot += len(enc.encode(item['value']))
|
| 259 |
+
elif item['type'] == 'image':
|
| 260 |
+
tot += self.get_image_token_len(item['value'], detail=self.img_detail)
|
| 261 |
+
return tot
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
class GPT4V(OpenAIWrapper):
|
| 265 |
+
|
| 266 |
+
def generate(self, message, dataset=None):
|
| 267 |
+
return super(GPT4V, self).generate(message)
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/config.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from vlmeval.vlm import *
|
| 2 |
+
from vlmeval.api import *
|
| 3 |
+
from functools import partial
|
| 4 |
+
|
| 5 |
+
minicpm_series = {
|
| 6 |
+
'MiniCPM-V': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V'),
|
| 7 |
+
'MiniCPM-V-2': partial(MiniCPM_V, model_path='openbmb/MiniCPM-V-2'),
|
| 8 |
+
'MiniCPM-Llama3-V-2_5': partial(MiniCPM_Llama3_V, model_path='openbmb/MiniCPM-Llama3-V-2_5'),
|
| 9 |
+
'MiniCPM-V-2_6': partial(MiniCPM_V_2_6, model_path='openbmb/MiniCPM-V-2_6'),
|
| 10 |
+
'MiniCPM-o-2_6': partial(MiniCPM_o_2_6, model_path='openbmb/MiniCPM-o-2_6'),
|
| 11 |
+
}
|
| 12 |
+
|
| 13 |
+
supported_VLM = {}
|
| 14 |
+
|
| 15 |
+
model_groups = [
|
| 16 |
+
minicpm_series
|
| 17 |
+
]
|
| 18 |
+
|
| 19 |
+
for grp in model_groups:
|
| 20 |
+
supported_VLM.update(grp)
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/dataset/__init__.py
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import warnings
|
| 2 |
+
|
| 3 |
+
from .image_base import img_root_map, ImageBaseDataset
|
| 4 |
+
from .image_caption import ImageCaptionDataset
|
| 5 |
+
from .image_yorn import ImageYORNDataset
|
| 6 |
+
from .image_mcq import (
|
| 7 |
+
ImageMCQDataset, MMMUDataset, CustomMCQDataset, MUIRDataset, GMAIMMBenchDataset, MMERealWorld, HRBenchDataset,
|
| 8 |
+
NaturalBenchDataset
|
| 9 |
+
)
|
| 10 |
+
from .image_mt import MMDUDataset
|
| 11 |
+
from .image_vqa import (
|
| 12 |
+
ImageVQADataset, MathVision, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
|
| 13 |
+
CustomVQADataset, CRPE, MathVerse, OlympiadBench, QSpatial, VizWiz, MMNIAH, WeMath, LogicVista
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
from .image_ccocr import CCOCRDataset
|
| 17 |
+
from .text_mcq import CustomTextMCQDataset, TextMCQDataset
|
| 18 |
+
|
| 19 |
+
from .vcr import VCRDataset
|
| 20 |
+
from .mmlongbench import MMLongBench
|
| 21 |
+
from .dude import DUDE
|
| 22 |
+
from .slidevqa import SlideVQA
|
| 23 |
+
from .vl_rewardbench import VLRewardBench
|
| 24 |
+
|
| 25 |
+
from .mmbench_video import MMBenchVideo
|
| 26 |
+
from .videomme import VideoMME
|
| 27 |
+
from .mvbench import MVBench, MVBench_MP4
|
| 28 |
+
from .mlvu import MLVU, MLVU_MCQ, MLVU_OpenEnded
|
| 29 |
+
from .tempcompass import TempCompass, TempCompass_Captioning, TempCompass_MCQ, TempCompass_YorN
|
| 30 |
+
from .longvideobench import LongVideoBench
|
| 31 |
+
from .video_concat_dataset import ConcatVideoDataset
|
| 32 |
+
from .mmgenbench import MMGenBench
|
| 33 |
+
from .cgbench import CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
|
| 34 |
+
|
| 35 |
+
from .miabench import MIABench
|
| 36 |
+
from .cmmmu import CMMMU
|
| 37 |
+
from .wildvision import WildVision
|
| 38 |
+
from .mmmath import MMMath
|
| 39 |
+
from .dynamath import Dynamath
|
| 40 |
+
from .utils import *
|
| 41 |
+
from .video_dataset_config import *
|
| 42 |
+
from ..smp import *
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class ConcatDataset(ImageBaseDataset):
|
| 46 |
+
# This dataset takes multiple dataset names as input and aggregate them into a single dataset.
|
| 47 |
+
# Each single dataset should not have a field named `SUB_DATASET`
|
| 48 |
+
|
| 49 |
+
DATASET_SETS = {
|
| 50 |
+
'MMMB': ['MMMB_ar', 'MMMB_cn', 'MMMB_en', 'MMMB_pt', 'MMMB_ru', 'MMMB_tr'],
|
| 51 |
+
'MTL_MMBench_DEV': [
|
| 52 |
+
'MMBench_dev_ar', 'MMBench_dev_cn', 'MMBench_dev_en',
|
| 53 |
+
'MMBench_dev_pt', 'MMBench_dev_ru', 'MMBench_dev_tr'
|
| 54 |
+
]
|
| 55 |
+
}
|
| 56 |
+
|
| 57 |
+
def __init__(self, dataset):
|
| 58 |
+
datasets = self.DATASET_SETS[dataset]
|
| 59 |
+
self.dataset_map = {}
|
| 60 |
+
# The name of the compliation
|
| 61 |
+
self.dataset_name = dataset
|
| 62 |
+
self.datasets = datasets
|
| 63 |
+
for dname in datasets:
|
| 64 |
+
dataset = build_dataset(dname)
|
| 65 |
+
assert dataset is not None, dataset
|
| 66 |
+
self.dataset_map[dname] = dataset
|
| 67 |
+
TYPES = [x.TYPE for x in self.dataset_map.values()]
|
| 68 |
+
MODALITIES = [x.MODALITY for x in self.dataset_map.values()]
|
| 69 |
+
assert np.all([x == TYPES[0] for x in TYPES]), (datasets, TYPES)
|
| 70 |
+
assert np.all([x == MODALITIES[0] for x in MODALITIES]), (datasets, MODALITIES)
|
| 71 |
+
self.TYPE = TYPES[0]
|
| 72 |
+
self.MODALITY = MODALITIES[0]
|
| 73 |
+
data_all = []
|
| 74 |
+
for dname in datasets:
|
| 75 |
+
data = self.dataset_map[dname].data
|
| 76 |
+
data['SUB_DATASET'] = [dname] * len(data)
|
| 77 |
+
data_new = localize_df(data, dname, nproc=16)
|
| 78 |
+
data_all.append(data_new)
|
| 79 |
+
|
| 80 |
+
data = pd.concat(data_all)
|
| 81 |
+
data['original_index'] = data.pop('index')
|
| 82 |
+
data['index'] = np.arange(len(data))
|
| 83 |
+
self.data = data
|
| 84 |
+
|
| 85 |
+
def build_prompt(self, line):
|
| 86 |
+
if isinstance(line, int):
|
| 87 |
+
line = self.data.iloc[line]
|
| 88 |
+
idx = line['original_index']
|
| 89 |
+
dname = line['SUB_DATASET']
|
| 90 |
+
org_data = self.dataset_map[dname].data
|
| 91 |
+
org_line = cp.deepcopy(org_data[org_data['index'] == idx]).iloc[0]
|
| 92 |
+
return self.dataset_map[dname].build_prompt(org_line)
|
| 93 |
+
|
| 94 |
+
def dump_image(self, line):
|
| 95 |
+
# Assert all images are pre-dumped
|
| 96 |
+
assert 'image' not in line
|
| 97 |
+
assert 'image_path' in line
|
| 98 |
+
tgt_path = toliststr(line['image_path'])
|
| 99 |
+
return tgt_path
|
| 100 |
+
|
| 101 |
+
@classmethod
|
| 102 |
+
def supported_datasets(cls):
|
| 103 |
+
return list(cls.DATASET_SETS)
|
| 104 |
+
|
| 105 |
+
def evaluate(self, eval_file, **judge_kwargs):
|
| 106 |
+
suffix = eval_file.split('.')[-1]
|
| 107 |
+
# First, split the eval_file by dataset
|
| 108 |
+
data_all = load(eval_file)
|
| 109 |
+
for dname in self.datasets:
|
| 110 |
+
tgt = eval_file.replace(self.dataset_name, dname)
|
| 111 |
+
data_sub = data_all[data_all['SUB_DATASET'] == dname]
|
| 112 |
+
data_sub.pop('index')
|
| 113 |
+
data_sub['index'] = data_sub.pop('original_index')
|
| 114 |
+
data_sub.pop('SUB_DATASET')
|
| 115 |
+
dump(data_sub, tgt)
|
| 116 |
+
# Then, evaluate each dataset separately
|
| 117 |
+
results_all = []
|
| 118 |
+
for dname in self.datasets:
|
| 119 |
+
tgt = eval_file.replace(self.dataset_name, dname)
|
| 120 |
+
res = self.dataset_map[dname].evaluate(tgt, **judge_kwargs)
|
| 121 |
+
assert isinstance(res, pd.DataFrame)
|
| 122 |
+
res['DATASET'] = [dname] * len(res)
|
| 123 |
+
results_all.append(res)
|
| 124 |
+
result = pd.concat(results_all)
|
| 125 |
+
score_file = eval_file.replace(f'.{suffix}', '_acc.csv')
|
| 126 |
+
dump(result, score_file)
|
| 127 |
+
return result
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
# Add new supported dataset class here
|
| 131 |
+
IMAGE_DATASET = [
|
| 132 |
+
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset, MathVision,
|
| 133 |
+
MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, MTVQADataset, TableVQABench,
|
| 134 |
+
MMLongBench, VCRDataset, MMDUDataset, DUDE, SlideVQA, MUIRDataset, CCOCRDataset,
|
| 135 |
+
GMAIMMBenchDataset, MMERealWorld, HRBenchDataset, CRPE, MathVerse, NaturalBenchDataset,
|
| 136 |
+
MIABench, OlympiadBench, WildVision, MMMath, QSpatial, Dynamath, MMGenBench, VizWiz, MMNIAH,
|
| 137 |
+
CMMMU, VLRewardBench, WeMath, LogicVista
|
| 138 |
+
]
|
| 139 |
+
|
| 140 |
+
VIDEO_DATASET = [
|
| 141 |
+
MMBenchVideo, VideoMME, MVBench, MVBench_MP4, LongVideoBench,
|
| 142 |
+
MLVU, MLVU_MCQ, MLVU_OpenEnded,
|
| 143 |
+
TempCompass, TempCompass_MCQ, TempCompass_Captioning, TempCompass_YorN,
|
| 144 |
+
CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
|
| 145 |
+
]
|
| 146 |
+
|
| 147 |
+
TEXT_DATASET = [
|
| 148 |
+
TextMCQDataset
|
| 149 |
+
]
|
| 150 |
+
|
| 151 |
+
CUSTOM_DATASET = [
|
| 152 |
+
CustomMCQDataset, CustomVQADataset, CustomTextMCQDataset
|
| 153 |
+
]
|
| 154 |
+
|
| 155 |
+
DATASET_COLLECTION = [ConcatDataset, ConcatVideoDataset]
|
| 156 |
+
|
| 157 |
+
DATASET_CLASSES = IMAGE_DATASET + VIDEO_DATASET + TEXT_DATASET + CUSTOM_DATASET + DATASET_COLLECTION
|
| 158 |
+
SUPPORTED_DATASETS = []
|
| 159 |
+
for DATASET_CLS in DATASET_CLASSES:
|
| 160 |
+
SUPPORTED_DATASETS.extend(DATASET_CLS.supported_datasets())
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
def DATASET_TYPE(dataset, *, default: str = 'MCQ') -> str:
|
| 164 |
+
for cls in DATASET_CLASSES:
|
| 165 |
+
if dataset in cls.supported_datasets():
|
| 166 |
+
if hasattr(cls, 'TYPE'):
|
| 167 |
+
return cls.TYPE
|
| 168 |
+
# Have to add specific routine to handle ConcatDataset
|
| 169 |
+
if dataset in ConcatDataset.DATASET_SETS:
|
| 170 |
+
dataset_list = ConcatDataset.DATASET_SETS[dataset]
|
| 171 |
+
TYPES = [DATASET_TYPE(dname) for dname in dataset_list]
|
| 172 |
+
assert np.all([x == TYPES[0] for x in TYPES]), (dataset_list, TYPES)
|
| 173 |
+
return TYPES[0]
|
| 174 |
+
|
| 175 |
+
if 'openended' in dataset.lower():
|
| 176 |
+
return 'VQA'
|
| 177 |
+
warnings.warn(f'Dataset {dataset} is a custom one and not annotated as `openended`, will treat as {default}. ')
|
| 178 |
+
return default
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
def DATASET_MODALITY(dataset, *, default: str = 'IMAGE') -> str:
|
| 182 |
+
if dataset is None:
|
| 183 |
+
warnings.warn(f'Dataset is not specified, will treat modality as {default}. ')
|
| 184 |
+
return default
|
| 185 |
+
for cls in DATASET_CLASSES:
|
| 186 |
+
if dataset in cls.supported_datasets():
|
| 187 |
+
if hasattr(cls, 'MODALITY'):
|
| 188 |
+
return cls.MODALITY
|
| 189 |
+
# Have to add specific routine to handle ConcatDataset
|
| 190 |
+
if dataset in ConcatDataset.DATASET_SETS:
|
| 191 |
+
dataset_list = ConcatDataset.DATASET_SETS[dataset]
|
| 192 |
+
MODALITIES = [DATASET_MODALITY(dname) for dname in dataset_list]
|
| 193 |
+
assert np.all([x == MODALITIES[0] for x in MODALITIES]), (dataset_list, MODALITIES)
|
| 194 |
+
return MODALITIES[0]
|
| 195 |
+
|
| 196 |
+
if 'VIDEO' in dataset.lower():
|
| 197 |
+
return 'VIDEO'
|
| 198 |
+
elif 'IMAGE' in dataset.lower():
|
| 199 |
+
return 'IMAGE'
|
| 200 |
+
warnings.warn(f'Dataset {dataset} is a custom one, will treat modality as {default}. ')
|
| 201 |
+
return default
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
def build_dataset(dataset_name, **kwargs):
|
| 205 |
+
for cls in DATASET_CLASSES:
|
| 206 |
+
if dataset_name in supported_video_datasets:
|
| 207 |
+
return supported_video_datasets[dataset_name](**kwargs)
|
| 208 |
+
elif dataset_name in cls.supported_datasets():
|
| 209 |
+
return cls(dataset=dataset_name, **kwargs)
|
| 210 |
+
|
| 211 |
+
warnings.warn(f'Dataset {dataset_name} is not officially supported. ')
|
| 212 |
+
|
| 213 |
+
data_file = osp.join(LMUDataRoot(), f'{dataset_name}.tsv')
|
| 214 |
+
if not osp.exists(data_file):
|
| 215 |
+
warnings.warn(f'Data file {data_file} does not exist. Dataset building failed. ')
|
| 216 |
+
return None
|
| 217 |
+
|
| 218 |
+
data = load(data_file)
|
| 219 |
+
if 'question' not in [x.lower() for x in data.columns]:
|
| 220 |
+
warnings.warn(f'Data file {data_file} does not have a `question` column. Dataset building failed. ')
|
| 221 |
+
return None
|
| 222 |
+
|
| 223 |
+
if 'A' in data and 'B' in data:
|
| 224 |
+
if 'image' in data or 'image_path' in data:
|
| 225 |
+
warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom MCQ dataset. ')
|
| 226 |
+
return CustomMCQDataset(dataset=dataset_name, **kwargs)
|
| 227 |
+
else:
|
| 228 |
+
warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom Text MCQ dataset. ')
|
| 229 |
+
return CustomTextMCQDataset(dataset=dataset_name, **kwargs)
|
| 230 |
+
else:
|
| 231 |
+
warnings.warn(f'Will assume unsupported dataset {dataset_name} as a Custom VQA dataset. ')
|
| 232 |
+
return CustomVQADataset(dataset=dataset_name, **kwargs)
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
__all__ = [
|
| 236 |
+
'build_dataset', 'img_root_map', 'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
|
| 237 |
+
] + [cls.__name__ for cls in DATASET_CLASSES]
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference.py
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
import torch.distributed as dist
|
| 3 |
+
from vlmeval.config import supported_VLM
|
| 4 |
+
from vlmeval.utils import track_progress_rich
|
| 5 |
+
from vlmeval.smp import *
|
| 6 |
+
|
| 7 |
+
FAIL_MSG = 'Failed to obtain answer via API.'
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def parse_args():
|
| 11 |
+
parser = argparse.ArgumentParser()
|
| 12 |
+
parser.add_argument('--data', type=str, nargs='+', required=True)
|
| 13 |
+
parser.add_argument('--model', type=str, nargs='+', required=True)
|
| 14 |
+
parser.add_argument('--nproc', type=int, default=4, required=True)
|
| 15 |
+
parser.add_argument('--verbose', action='store_true')
|
| 16 |
+
args = parser.parse_args()
|
| 17 |
+
return args
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# Only API model is accepted
|
| 21 |
+
def infer_data_api(model, work_dir, model_name, dataset, index_set=None, api_nproc=4, ignore_failed=False):
|
| 22 |
+
rank, world_size = get_rank_and_world_size()
|
| 23 |
+
assert rank == 0 and world_size == 1
|
| 24 |
+
dataset_name = dataset.dataset_name
|
| 25 |
+
data = dataset.data
|
| 26 |
+
if index_set is not None:
|
| 27 |
+
data = data[data['index'].isin(index_set)]
|
| 28 |
+
|
| 29 |
+
model = supported_VLM[model_name]() if isinstance(model, str) else model
|
| 30 |
+
assert getattr(model, 'is_api', False)
|
| 31 |
+
if hasattr(model, 'set_dump_image'):
|
| 32 |
+
model.set_dump_image(dataset.dump_image)
|
| 33 |
+
|
| 34 |
+
lt, indices = len(data), list(data['index'])
|
| 35 |
+
|
| 36 |
+
structs = []
|
| 37 |
+
for i in range(lt):
|
| 38 |
+
item = data.iloc[i]
|
| 39 |
+
if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
|
| 40 |
+
assert hasattr(model, 'build_prompt')
|
| 41 |
+
struct = model.build_prompt(item, dataset=dataset_name)
|
| 42 |
+
else:
|
| 43 |
+
struct = dataset.build_prompt(item)
|
| 44 |
+
structs.append(struct)
|
| 45 |
+
|
| 46 |
+
# structs = [dataset.build_prompt(data.iloc[i]) for i in range(lt)]
|
| 47 |
+
|
| 48 |
+
out_file = f'{work_dir}/{model_name}_{dataset_name}_supp.pkl'
|
| 49 |
+
res = {}
|
| 50 |
+
if osp.exists(out_file):
|
| 51 |
+
res = load(out_file)
|
| 52 |
+
if ignore_failed:
|
| 53 |
+
res = {k: v for k, v in res.items() if FAIL_MSG not in v}
|
| 54 |
+
|
| 55 |
+
structs = [s for i, s in zip(indices, structs) if i not in res]
|
| 56 |
+
indices = [i for i in indices if i not in res]
|
| 57 |
+
|
| 58 |
+
gen_func = model.generate
|
| 59 |
+
structs = [dict(message=struct, dataset=dataset_name) for struct in structs]
|
| 60 |
+
|
| 61 |
+
if len(structs):
|
| 62 |
+
track_progress_rich(gen_func, structs, nproc=api_nproc, chunksize=api_nproc, save=out_file, keys=indices)
|
| 63 |
+
|
| 64 |
+
res = load(out_file)
|
| 65 |
+
if index_set is not None:
|
| 66 |
+
res = {k: v for k, v in res.items() if k in index_set}
|
| 67 |
+
os.remove(out_file)
|
| 68 |
+
return res
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def infer_data(model, model_name, work_dir, dataset, out_file, verbose=False, api_nproc=4):
|
| 72 |
+
dataset_name = dataset.dataset_name
|
| 73 |
+
prev_file = f'{work_dir}/{model_name}_{dataset_name}_PREV.pkl'
|
| 74 |
+
res = load(prev_file) if osp.exists(prev_file) else {}
|
| 75 |
+
if osp.exists(out_file):
|
| 76 |
+
res.update(load(out_file))
|
| 77 |
+
|
| 78 |
+
rank, world_size = get_rank_and_world_size()
|
| 79 |
+
sheet_indices = list(range(rank, len(dataset), world_size))
|
| 80 |
+
lt = len(sheet_indices)
|
| 81 |
+
data = dataset.data.iloc[sheet_indices]
|
| 82 |
+
data_indices = [i for i in data['index']]
|
| 83 |
+
|
| 84 |
+
# If finished, will exit without building the model
|
| 85 |
+
all_finished = True
|
| 86 |
+
for i in range(lt):
|
| 87 |
+
idx = data.iloc[i]['index']
|
| 88 |
+
if idx not in res:
|
| 89 |
+
all_finished = False
|
| 90 |
+
if all_finished:
|
| 91 |
+
res = {k: res[k] for k in data_indices}
|
| 92 |
+
dump(res, out_file)
|
| 93 |
+
return
|
| 94 |
+
|
| 95 |
+
# Data need to be inferred
|
| 96 |
+
data = data[~data['index'].isin(res)]
|
| 97 |
+
lt = len(data)
|
| 98 |
+
|
| 99 |
+
model = supported_VLM[model_name]() if isinstance(model, str) else model
|
| 100 |
+
|
| 101 |
+
is_api = getattr(model, 'is_api', False)
|
| 102 |
+
if is_api:
|
| 103 |
+
lt, indices = len(data), list(data['index'])
|
| 104 |
+
supp = infer_data_api(
|
| 105 |
+
model=model,
|
| 106 |
+
work_dir=work_dir,
|
| 107 |
+
model_name=model_name,
|
| 108 |
+
dataset=dataset,
|
| 109 |
+
index_set=set(indices),
|
| 110 |
+
api_nproc=api_nproc)
|
| 111 |
+
for idx in indices:
|
| 112 |
+
assert idx in supp
|
| 113 |
+
res.update(supp)
|
| 114 |
+
res = {k: res[k] for k in data_indices}
|
| 115 |
+
dump(res, out_file)
|
| 116 |
+
return model
|
| 117 |
+
else:
|
| 118 |
+
model.set_dump_image(dataset.dump_image)
|
| 119 |
+
|
| 120 |
+
for i in tqdm(range(lt)):
|
| 121 |
+
idx = data.iloc[i]['index']
|
| 122 |
+
if idx in res:
|
| 123 |
+
continue
|
| 124 |
+
|
| 125 |
+
if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
|
| 126 |
+
struct = model.build_prompt(data.iloc[i], dataset=dataset_name)
|
| 127 |
+
else:
|
| 128 |
+
struct = dataset.build_prompt(data.iloc[i])
|
| 129 |
+
|
| 130 |
+
response = model.generate(message=struct, dataset=dataset_name)
|
| 131 |
+
torch.cuda.empty_cache()
|
| 132 |
+
|
| 133 |
+
if verbose:
|
| 134 |
+
print(response, flush=True)
|
| 135 |
+
|
| 136 |
+
res[idx] = response
|
| 137 |
+
if (i + 1) % 10 == 0:
|
| 138 |
+
dump(res, out_file)
|
| 139 |
+
|
| 140 |
+
res = {k: res[k] for k in data_indices}
|
| 141 |
+
dump(res, out_file)
|
| 142 |
+
return model
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
# A wrapper for infer_data, do the pre & post processing
|
| 146 |
+
def infer_data_job(model, work_dir, model_name, dataset, verbose=False, api_nproc=4, ignore_failed=False):
|
| 147 |
+
rank, world_size = get_rank_and_world_size()
|
| 148 |
+
dataset_name = dataset.dataset_name
|
| 149 |
+
result_file = osp.join(work_dir, f'{model_name}_{dataset_name}.xlsx')
|
| 150 |
+
|
| 151 |
+
prev_file = f'{work_dir}/{model_name}_{dataset_name}_PREV.pkl'
|
| 152 |
+
if osp.exists(result_file):
|
| 153 |
+
if rank == 0:
|
| 154 |
+
data = load(result_file)
|
| 155 |
+
results = {k: v for k, v in zip(data['index'], data['prediction'])}
|
| 156 |
+
if not ignore_failed:
|
| 157 |
+
results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
|
| 158 |
+
dump(results, prev_file)
|
| 159 |
+
if world_size > 1:
|
| 160 |
+
dist.barrier()
|
| 161 |
+
|
| 162 |
+
tmpl = osp.join(work_dir, '{}' + f'{world_size}_{dataset_name}.pkl')
|
| 163 |
+
out_file = tmpl.format(rank)
|
| 164 |
+
|
| 165 |
+
model = infer_data(
|
| 166 |
+
model=model, work_dir=work_dir, model_name=model_name, dataset=dataset,
|
| 167 |
+
out_file=out_file, verbose=verbose, api_nproc=api_nproc)
|
| 168 |
+
if world_size > 1:
|
| 169 |
+
dist.barrier()
|
| 170 |
+
|
| 171 |
+
if rank == 0:
|
| 172 |
+
data_all = {}
|
| 173 |
+
for i in range(world_size):
|
| 174 |
+
data_all.update(load(tmpl.format(i)))
|
| 175 |
+
|
| 176 |
+
data = dataset.data
|
| 177 |
+
for x in data['index']:
|
| 178 |
+
assert x in data_all
|
| 179 |
+
data['prediction'] = [str(data_all[x]) for x in data['index']]
|
| 180 |
+
if 'image' in data:
|
| 181 |
+
data.pop('image')
|
| 182 |
+
|
| 183 |
+
dump(data, result_file)
|
| 184 |
+
for i in range(world_size):
|
| 185 |
+
os.remove(tmpl.format(i))
|
| 186 |
+
if world_size > 1:
|
| 187 |
+
dist.barrier()
|
| 188 |
+
return model
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_mt.py
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
import torch.distributed as dist
|
| 3 |
+
from vlmeval.config import supported_VLM
|
| 4 |
+
from vlmeval.utils import track_progress_rich
|
| 5 |
+
from vlmeval.smp import *
|
| 6 |
+
|
| 7 |
+
FAIL_MSG = 'Failed to obtain answer via API.'
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def parse_args():
|
| 11 |
+
parser = argparse.ArgumentParser()
|
| 12 |
+
parser.add_argument('--data', type=str, nargs='+', required=True)
|
| 13 |
+
parser.add_argument('--model', type=str, nargs='+', required=True)
|
| 14 |
+
parser.add_argument('--nproc', type=int, default=4, required=True)
|
| 15 |
+
parser.add_argument('--verbose', action='store_true')
|
| 16 |
+
args = parser.parse_args()
|
| 17 |
+
return args
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def chat_mt(model, messages, dataset_name):
|
| 21 |
+
assert len(messages) % 2 == 0
|
| 22 |
+
nturn = len(messages) // 2
|
| 23 |
+
utter_stack = []
|
| 24 |
+
predictions = []
|
| 25 |
+
|
| 26 |
+
for i in range(nturn):
|
| 27 |
+
utter = messages[2 * i]
|
| 28 |
+
utter_stack.append(utter)
|
| 29 |
+
try:
|
| 30 |
+
resp = model.chat(utter_stack, dataset=dataset_name)
|
| 31 |
+
utter_stack.append(dict(role='assistant', content=resp))
|
| 32 |
+
except Exception as e:
|
| 33 |
+
resp = FAIL_MSG + str(e)
|
| 34 |
+
utter_stack.append(dict(role='assistant', content=resp))
|
| 35 |
+
predictions.append(resp)
|
| 36 |
+
return predictions
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
# Only API model is accepted
|
| 40 |
+
def infer_data_api(model, work_dir, model_name, dataset, index_set=None, api_nproc=4, ignore_failed=False):
|
| 41 |
+
rank, world_size = get_rank_and_world_size()
|
| 42 |
+
assert rank == 0 and world_size == 1
|
| 43 |
+
dataset_name = dataset.dataset_name
|
| 44 |
+
data = dataset.data
|
| 45 |
+
if index_set is not None:
|
| 46 |
+
data = data[data['index'].isin(index_set)]
|
| 47 |
+
|
| 48 |
+
model = supported_VLM[model_name]() if isinstance(model, str) else model
|
| 49 |
+
assert getattr(model, 'is_api', False)
|
| 50 |
+
assert hasattr(model, 'chat_inner')
|
| 51 |
+
|
| 52 |
+
lt, indices = len(data), list(data['index'])
|
| 53 |
+
structs = [dataset.build_prompt(data.iloc[i]) for i in range(lt)]
|
| 54 |
+
|
| 55 |
+
out_file = f'{work_dir}/{model_name}_{dataset_name}_supp.pkl'
|
| 56 |
+
res = {}
|
| 57 |
+
if osp.exists(out_file):
|
| 58 |
+
res = load(out_file)
|
| 59 |
+
if ignore_failed:
|
| 60 |
+
res = {k: v for k, v in res.items() if FAIL_MSG not in v}
|
| 61 |
+
|
| 62 |
+
structs = [s for i, s in zip(indices, structs) if i not in res]
|
| 63 |
+
indices = [i for i in indices if i not in res]
|
| 64 |
+
|
| 65 |
+
structs = [dict(model=model, messages=struct, dataset_name=dataset_name) for struct in structs]
|
| 66 |
+
|
| 67 |
+
if len(structs):
|
| 68 |
+
track_progress_rich(chat_mt, structs, nproc=api_nproc, chunksize=api_nproc, save=out_file, keys=indices)
|
| 69 |
+
|
| 70 |
+
res = load(out_file)
|
| 71 |
+
if index_set is not None:
|
| 72 |
+
res = {k: v for k, v in res.items() if k in index_set}
|
| 73 |
+
os.remove(out_file)
|
| 74 |
+
return res
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def infer_data(model, model_name, work_dir, dataset, out_file, verbose=False, api_nproc=4):
|
| 78 |
+
dataset_name = dataset.dataset_name
|
| 79 |
+
res = {}
|
| 80 |
+
if osp.exists(out_file):
|
| 81 |
+
res.update(load(out_file))
|
| 82 |
+
|
| 83 |
+
rank, world_size = get_rank_and_world_size()
|
| 84 |
+
sheet_indices = list(range(rank, len(dataset), world_size))
|
| 85 |
+
lt = len(sheet_indices)
|
| 86 |
+
data = dataset.data.iloc[sheet_indices]
|
| 87 |
+
data_indices = [i for i in data['index']]
|
| 88 |
+
|
| 89 |
+
# If finished, will exit without building the model
|
| 90 |
+
all_finished = True
|
| 91 |
+
for i in range(lt):
|
| 92 |
+
idx = data.iloc[i]['index']
|
| 93 |
+
if idx not in res:
|
| 94 |
+
all_finished = False
|
| 95 |
+
if all_finished:
|
| 96 |
+
res = {k: res[k] for k in data_indices}
|
| 97 |
+
dump(res, out_file)
|
| 98 |
+
return
|
| 99 |
+
|
| 100 |
+
# Data need to be inferred
|
| 101 |
+
data = data[~data['index'].isin(res)]
|
| 102 |
+
lt = len(data)
|
| 103 |
+
|
| 104 |
+
model = supported_VLM[model_name]() if isinstance(model, str) else model
|
| 105 |
+
assert hasattr(model, 'chat_inner')
|
| 106 |
+
|
| 107 |
+
is_api = getattr(model, 'is_api', False)
|
| 108 |
+
if is_api:
|
| 109 |
+
lt, indices = len(data), list(data['index'])
|
| 110 |
+
supp = infer_data_api(
|
| 111 |
+
model=model,
|
| 112 |
+
work_dir=work_dir,
|
| 113 |
+
model_name=model_name,
|
| 114 |
+
dataset=dataset,
|
| 115 |
+
index_set=set(indices),
|
| 116 |
+
api_nproc=api_nproc)
|
| 117 |
+
for idx in indices:
|
| 118 |
+
assert idx in supp
|
| 119 |
+
res.update(supp)
|
| 120 |
+
res = {k: res[k] for k in data_indices}
|
| 121 |
+
dump(res, out_file)
|
| 122 |
+
return model
|
| 123 |
+
else:
|
| 124 |
+
model.set_dump_image(dataset.dump_image)
|
| 125 |
+
|
| 126 |
+
for i in tqdm(range(lt)):
|
| 127 |
+
idx = data.iloc[i]['index']
|
| 128 |
+
if idx in res:
|
| 129 |
+
continue
|
| 130 |
+
|
| 131 |
+
if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
|
| 132 |
+
struct = model.build_prompt(data.iloc[i], dataset=dataset_name)
|
| 133 |
+
else:
|
| 134 |
+
struct = dataset.build_prompt(data.iloc[i])
|
| 135 |
+
|
| 136 |
+
response = chat_mt(model, struct, dataset_name)
|
| 137 |
+
torch.cuda.empty_cache()
|
| 138 |
+
|
| 139 |
+
if verbose:
|
| 140 |
+
print(response, flush=True)
|
| 141 |
+
|
| 142 |
+
res[idx] = response
|
| 143 |
+
if (i + 1) % 20 == 0:
|
| 144 |
+
dump(res, out_file)
|
| 145 |
+
|
| 146 |
+
res = {k: res[k] for k in data_indices}
|
| 147 |
+
dump(res, out_file)
|
| 148 |
+
return model
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
# A wrapper for infer_data, do the pre & post processing
|
| 152 |
+
def infer_data_job_mt(model, work_dir, model_name, dataset, verbose=False, api_nproc=4, ignore_failed=False):
|
| 153 |
+
rank, world_size = get_rank_and_world_size()
|
| 154 |
+
dataset_name = dataset.dataset_name
|
| 155 |
+
result_file = osp.join(work_dir, f'{model_name}_{dataset_name}.tsv')
|
| 156 |
+
|
| 157 |
+
tmpl = osp.join(work_dir, '{}' + f'{world_size}_{dataset_name}.pkl')
|
| 158 |
+
out_file = tmpl.format(rank)
|
| 159 |
+
|
| 160 |
+
model = infer_data(
|
| 161 |
+
model=model, model_name=model_name,work_dir=work_dir, dataset=dataset,
|
| 162 |
+
out_file=out_file, verbose=verbose, api_nproc=api_nproc)
|
| 163 |
+
if world_size > 1:
|
| 164 |
+
dist.barrier()
|
| 165 |
+
|
| 166 |
+
if rank == 0:
|
| 167 |
+
data_all = {}
|
| 168 |
+
for i in range(world_size):
|
| 169 |
+
data_all.update(load(tmpl.format(i)))
|
| 170 |
+
|
| 171 |
+
data = dataset.data
|
| 172 |
+
for x in data['index']:
|
| 173 |
+
assert x in data_all
|
| 174 |
+
|
| 175 |
+
data['prediction'] = [data_all[x] for x in data['index']]
|
| 176 |
+
if 'image' in data:
|
| 177 |
+
data.pop('image')
|
| 178 |
+
|
| 179 |
+
dump(data, result_file)
|
| 180 |
+
for i in range(world_size):
|
| 181 |
+
os.remove(tmpl.format(i))
|
| 182 |
+
return model
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/inference_video.py
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
import torch.distributed as dist
|
| 3 |
+
from vlmeval.config import supported_VLM
|
| 4 |
+
from vlmeval.utils import track_progress_rich
|
| 5 |
+
from vlmeval.smp import *
|
| 6 |
+
|
| 7 |
+
FAIL_MSG = 'Failed to obtain answer via API.'
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def parse_args():
|
| 11 |
+
parser = argparse.ArgumentParser()
|
| 12 |
+
parser.add_argument('--data', type=str, nargs='+', required=True)
|
| 13 |
+
parser.add_argument('--model', type=str, nargs='+', required=True)
|
| 14 |
+
parser.add_argument('--nproc', type=int, default=4, required=True)
|
| 15 |
+
parser.add_argument('--verbose', action='store_true')
|
| 16 |
+
args = parser.parse_args()
|
| 17 |
+
return args
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# Only API model is accepted
|
| 21 |
+
def infer_data_api(model, work_dir, model_name, dataset, samples_dict={}, api_nproc=4):
|
| 22 |
+
rank, world_size = get_rank_and_world_size()
|
| 23 |
+
assert rank == 0 and world_size == 1
|
| 24 |
+
dataset_name = dataset.dataset_name
|
| 25 |
+
model = supported_VLM[model_name]() if isinstance(model, str) else model
|
| 26 |
+
assert getattr(model, 'is_api', False)
|
| 27 |
+
|
| 28 |
+
indices = list(samples_dict.keys())
|
| 29 |
+
structs = [dataset.build_prompt(samples_dict[idx], video_llm=getattr(model, 'VIDEO_LLM', False)) for idx in indices]
|
| 30 |
+
|
| 31 |
+
packstr = 'pack' if getattr(dataset, 'pack', False) else 'nopack'
|
| 32 |
+
if dataset.nframe > 0:
|
| 33 |
+
out_file = f'{work_dir}/{model_name}_{dataset_name}_{dataset.nframe}frame_{packstr}_supp.pkl'
|
| 34 |
+
else:
|
| 35 |
+
out_file = f'{work_dir}/{model_name}_{dataset_name}_{dataset.fps}fps_{packstr}_supp.pkl'
|
| 36 |
+
res = load(out_file) if osp.exists(out_file) else {}
|
| 37 |
+
|
| 38 |
+
structs = [s for i, s in zip(indices, structs) if i not in res or res[i] == FAIL_MSG]
|
| 39 |
+
indices = [i for i in indices if i not in res or res[i] == FAIL_MSG]
|
| 40 |
+
|
| 41 |
+
gen_func = model.generate
|
| 42 |
+
structs = [dict(message=struct, dataset=dataset_name) for struct in structs]
|
| 43 |
+
|
| 44 |
+
if len(structs):
|
| 45 |
+
track_progress_rich(gen_func, structs, nproc=api_nproc, chunksize=api_nproc, save=out_file, keys=indices)
|
| 46 |
+
|
| 47 |
+
res = load(out_file)
|
| 48 |
+
return res
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def infer_data(model, model_name, work_dir, dataset, out_file, verbose=False, api_nproc=4):
|
| 52 |
+
res = load(out_file) if osp.exists(out_file) else {}
|
| 53 |
+
rank, world_size = get_rank_and_world_size()
|
| 54 |
+
dataset_name = dataset.dataset_name
|
| 55 |
+
|
| 56 |
+
sample_indices = list(dataset.videos) if getattr(dataset, 'pack', False) else list(dataset.data['index'])
|
| 57 |
+
samples = list(dataset.videos) if getattr(dataset, 'pack', False) else list(range(len(dataset.data)))
|
| 58 |
+
sample_map = {i: s for i, s in zip(sample_indices, samples)}
|
| 59 |
+
|
| 60 |
+
sample_indices_sub = sample_indices[rank::world_size]
|
| 61 |
+
if np.all([idx in res for idx in sample_indices_sub]):
|
| 62 |
+
return model
|
| 63 |
+
sample_indices_subrem = [x for x in sample_indices_sub if x not in res]
|
| 64 |
+
|
| 65 |
+
model = supported_VLM[model_name]() if isinstance(model, str) else model
|
| 66 |
+
|
| 67 |
+
is_api = getattr(model, 'is_api', False)
|
| 68 |
+
if is_api:
|
| 69 |
+
assert world_size == 1
|
| 70 |
+
supp = infer_data_api(
|
| 71 |
+
model=model,
|
| 72 |
+
work_dir=work_dir,
|
| 73 |
+
model_name=model_name,
|
| 74 |
+
dataset=dataset,
|
| 75 |
+
samples_dict={k: sample_map[k] for k in sample_indices_subrem},
|
| 76 |
+
api_nproc=api_nproc)
|
| 77 |
+
for k in sample_indices_subrem:
|
| 78 |
+
assert k in supp
|
| 79 |
+
res.update(supp)
|
| 80 |
+
dump(res, out_file)
|
| 81 |
+
return model
|
| 82 |
+
|
| 83 |
+
assert not getattr(dataset, 'pack', False), 'Current model not supported pack mode!'
|
| 84 |
+
for i, idx in tqdm(enumerate(sample_indices_subrem)):
|
| 85 |
+
if idx in res:
|
| 86 |
+
continue
|
| 87 |
+
if getattr(model, 'nframe', None) is not None and getattr(model, 'nframe', 0) > 0:
|
| 88 |
+
if dataset.nframe > 0:
|
| 89 |
+
if getattr(model, 'nframe', 0) != dataset.nframe:
|
| 90 |
+
print(f'{model_name} is a video-llm model, nframe is set to {dataset.nframe}, not using default')
|
| 91 |
+
setattr(model, 'nframe', dataset.nframe)
|
| 92 |
+
elif getattr(model, 'fps', 0) == 0:
|
| 93 |
+
raise ValueError(f'fps is not suitable for {model_name}')
|
| 94 |
+
else:
|
| 95 |
+
setattr(model, 'nframe', None)
|
| 96 |
+
if getattr(model, 'fps', None) is not None and getattr(model, 'fps', 0) > 0:
|
| 97 |
+
if dataset.fps > 0:
|
| 98 |
+
if getattr(model, 'fps', 0) != dataset.fps:
|
| 99 |
+
print(f'{model_name} is a video-llm model, fps is set to {dataset.fps}, not using default')
|
| 100 |
+
setattr(model, 'fps', dataset.fps)
|
| 101 |
+
elif getattr(model, 'nframe', 0) == 0:
|
| 102 |
+
raise ValueError(f'nframe is not suitable for {model_name}')
|
| 103 |
+
else:
|
| 104 |
+
setattr(model, 'fps', None)
|
| 105 |
+
if 'SUB_DATASET' in dataset.data.iloc[sample_map[idx]]:
|
| 106 |
+
dataset_name = dataset.data.iloc[sample_map[idx]]['SUB_DATASET']
|
| 107 |
+
if hasattr(model, 'use_custom_prompt') and model.use_custom_prompt(dataset_name):
|
| 108 |
+
if dataset.nframe == 0:
|
| 109 |
+
raise ValueError(f'nframe must be set for custom prompt, fps is not suitable for {model_name}')
|
| 110 |
+
struct = model.build_prompt(
|
| 111 |
+
dataset.data.iloc[sample_map[idx]], dataset=dataset, video_llm=getattr(model, 'VIDEO_LLM', False)
|
| 112 |
+
)
|
| 113 |
+
else:
|
| 114 |
+
struct = dataset.build_prompt(
|
| 115 |
+
sample_map[idx], video_llm=getattr(model, 'VIDEO_LLM', False)
|
| 116 |
+
)
|
| 117 |
+
response = model.generate(message=struct, dataset=dataset_name)
|
| 118 |
+
torch.cuda.empty_cache()
|
| 119 |
+
|
| 120 |
+
if verbose:
|
| 121 |
+
print(response, flush=True)
|
| 122 |
+
|
| 123 |
+
res[idx] = response
|
| 124 |
+
if (i + 1) % 20 == 0:
|
| 125 |
+
dump(res, out_file)
|
| 126 |
+
|
| 127 |
+
res = {k: res[k] for k in sample_indices_sub}
|
| 128 |
+
dump(res, out_file)
|
| 129 |
+
return model
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
# A wrapper for infer_data, do the pre & post processing
|
| 133 |
+
def infer_data_job_video(
|
| 134 |
+
model,
|
| 135 |
+
work_dir,
|
| 136 |
+
model_name,
|
| 137 |
+
dataset,
|
| 138 |
+
result_file_name,
|
| 139 |
+
verbose=False,
|
| 140 |
+
api_nproc=4):
|
| 141 |
+
|
| 142 |
+
dataset_name = dataset.dataset_name
|
| 143 |
+
rank, world_size = get_rank_and_world_size()
|
| 144 |
+
result_file = osp.join(work_dir, result_file_name)
|
| 145 |
+
# Dump Predictions to Prev File if result file exists
|
| 146 |
+
if osp.exists(result_file):
|
| 147 |
+
return model
|
| 148 |
+
|
| 149 |
+
tmpl = osp.join(work_dir, '{}' + f'{world_size}_{osp.splitext(result_file_name)[0]}.pkl')
|
| 150 |
+
out_file = tmpl.format(rank)
|
| 151 |
+
|
| 152 |
+
model = infer_data(
|
| 153 |
+
model=model,
|
| 154 |
+
model_name=model_name,
|
| 155 |
+
work_dir=work_dir,
|
| 156 |
+
dataset=dataset,
|
| 157 |
+
out_file=out_file,
|
| 158 |
+
verbose=verbose,
|
| 159 |
+
api_nproc=api_nproc)
|
| 160 |
+
|
| 161 |
+
if world_size > 1:
|
| 162 |
+
dist.barrier()
|
| 163 |
+
|
| 164 |
+
if rank == 0:
|
| 165 |
+
data_all = {}
|
| 166 |
+
for i in range(world_size):
|
| 167 |
+
data_all.update(load(tmpl.format(i)))
|
| 168 |
+
|
| 169 |
+
meta = dataset.data
|
| 170 |
+
if dataset_name == 'MMBench-Video' and getattr(dataset, 'pack', False):
|
| 171 |
+
meta, vstats = dataset.load_pack_answers(data_all)
|
| 172 |
+
print(f'Statitics of Pack Video Inference: {vstats}')
|
| 173 |
+
else:
|
| 174 |
+
for x in meta['index']:
|
| 175 |
+
assert x in data_all
|
| 176 |
+
meta['prediction'] = [str(data_all[x]) for x in meta['index']]
|
| 177 |
+
if 'image' in meta:
|
| 178 |
+
meta.pop('image')
|
| 179 |
+
|
| 180 |
+
dump(meta, result_file)
|
| 181 |
+
for i in range(world_size):
|
| 182 |
+
os.remove(tmpl.format(i))
|
| 183 |
+
return model
|
r1-a/response_generation/minicpm/MiniCPM-o/eval_mm/vlmevalkit/vlmeval/tools.py
ADDED
|
@@ -0,0 +1,468 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import sys
|
| 2 |
+
from vlmeval.dataset import SUPPORTED_DATASETS
|
| 3 |
+
from vlmeval.config import *
|
| 4 |
+
from vlmeval.smp import *
|
| 5 |
+
|
| 6 |
+
# Define valid modes
|
| 7 |
+
MODES = ('dlist', 'mlist', 'missing', 'circular', 'localize', 'check', 'run', 'eval', 'merge_pkl')
|
| 8 |
+
|
| 9 |
+
CLI_HELP_MSG = \
|
| 10 |
+
f"""
|
| 11 |
+
Arguments received: {str(['vlmutil'] + sys.argv[1:])}. vlmutil commands use the following syntax:
|
| 12 |
+
|
| 13 |
+
vlmutil MODE MODE_ARGS
|
| 14 |
+
|
| 15 |
+
Where MODE (required) is one of {MODES}
|
| 16 |
+
MODE_ARG (optional) is the argument for specific mode
|
| 17 |
+
|
| 18 |
+
Some usages for xtuner commands: (See more by using -h for specific command!)
|
| 19 |
+
|
| 20 |
+
1. List all the dataset by levels: l1, l2, l3, etc.:
|
| 21 |
+
vlmutil dlist [l1/l2/l3/...]
|
| 22 |
+
2. List all the models by categories: 4.33.0, 4.37.0, api, etc.:
|
| 23 |
+
vlmutil mlist 4.33.0 [all/small/large]
|
| 24 |
+
3. Report missing results:
|
| 25 |
+
vlmutil missing [l1/l2/l3/...]
|
| 26 |
+
4. Create circular questions (only for multiple-choice questions with no more than 4 choices):
|
| 27 |
+
vlmutil circular input.tsv
|
| 28 |
+
5. Create a localized version of the dataset (for very large tsv files):
|
| 29 |
+
vlmutil localize input.tsv
|
| 30 |
+
6. Check the validity of a model:
|
| 31 |
+
vlmutil check [model_name/model_series]
|
| 32 |
+
7. Run evaluation for missing results:
|
| 33 |
+
vlmutil run l2 hf
|
| 34 |
+
8. Evaluate data file:
|
| 35 |
+
vlmutil eval [dataset_name] [prediction_file]
|
| 36 |
+
9. Merge pkl files:
|
| 37 |
+
vlmutil merge_pkl [pkl_dir] [world_size]
|
| 38 |
+
|
| 39 |
+
GitHub: https://github.com/open-compass/VLMEvalKit
|
| 40 |
+
""" # noqa: E501
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
dataset_levels = {
|
| 44 |
+
'l1': [
|
| 45 |
+
('MMVet', 'gpt-4-turbo_score.csv'), ('MMMU_DEV_VAL', 'acc.csv'),
|
| 46 |
+
('MathVista_MINI', 'gpt-4-turbo_score.csv'), ('HallusionBench', 'score.csv'),
|
| 47 |
+
('OCRBench', 'score.json'), ('AI2D_TEST', 'acc.csv'), ('MMStar', 'acc.csv'),
|
| 48 |
+
('MMBench_V11', 'acc.csv'), ('MMBench_CN_V11', 'acc.csv')
|
| 49 |
+
],
|
| 50 |
+
'l2': [
|
| 51 |
+
('MME', 'score.csv'), ('LLaVABench', 'score.csv'), ('RealWorldQA', 'acc.csv'),
|
| 52 |
+
('MMBench', 'acc.csv'), ('MMBench_CN', 'acc.csv'), ('CCBench', 'acc.csv'),
|
| 53 |
+
('SEEDBench_IMG', 'acc.csv'), ('COCO_VAL', 'score.json'), ('POPE', 'score.csv'),
|
| 54 |
+
('ScienceQA_VAL', 'acc.csv'), ('ScienceQA_TEST', 'acc.csv'), ('MMT-Bench_VAL', 'acc.csv'),
|
| 55 |
+
('SEEDBench2_Plus', 'acc.csv'), ('BLINK', 'acc.csv'), ('MTVQA_TEST', 'acc.json'),
|
| 56 |
+
('Q-Bench1_VAL', 'acc.csv'), ('A-Bench_VAL', 'acc.csv'), ('R-Bench-Dis', 'acc.csv'),
|
| 57 |
+
('MathVision', 'score.csv'), ('MathVerse_MINI_Vision_Only', 'score.csv'), ('DynaMath', 'score.csv'),
|
| 58 |
+
],
|
| 59 |
+
'l3': [
|
| 60 |
+
('OCRVQA_TESTCORE', 'acc.csv'), ('TextVQA_VAL', 'acc.csv'),
|
| 61 |
+
('ChartQA_TEST', 'acc.csv'), ('DocVQA_VAL', 'acc.csv'), ('InfoVQA_VAL', 'acc.csv'),
|
| 62 |
+
('SEEDBench2', 'acc.csv')
|
| 63 |
+
]
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
dataset_levels['l12'] = dataset_levels['l1'] + dataset_levels['l2']
|
| 67 |
+
dataset_levels['l23'] = dataset_levels['l2'] + dataset_levels['l3']
|
| 68 |
+
dataset_levels['l123'] = dataset_levels['l12'] + dataset_levels['l3']
|
| 69 |
+
|
| 70 |
+
models = {
|
| 71 |
+
'4.37.0': ['MiniCPM-V', 'MiniCPM-V-2'],
|
| 72 |
+
'4.40.0': ['MiniCPM-Llama3-V-2_5'],
|
| 73 |
+
'latest': ['MiniCPM-V-2_6']
|
| 74 |
+
}
|
| 75 |
+
|
| 76 |
+
# SKIP_MODELS will be skipped in report_missing and run APIs
|
| 77 |
+
SKIP_MODELS = ['MiniCPM-V']
|
| 78 |
+
|
| 79 |
+
def completed(m, d, suf):
|
| 80 |
+
score_file = f'outputs/{m}/{m}_{d}_{suf}'
|
| 81 |
+
if osp.exists(score_file):
|
| 82 |
+
return True
|
| 83 |
+
if d == 'MMBench':
|
| 84 |
+
s1, s2 = f'outputs/{m}/{m}_MMBench_DEV_EN_{suf}', f'outputs/{m}/{m}_MMBench_TEST_EN_{suf}'
|
| 85 |
+
return osp.exists(s1) and osp.exists(s2)
|
| 86 |
+
elif d == 'MMBench_CN':
|
| 87 |
+
s1, s2 = f'outputs/{m}/{m}_MMBench_DEV_CN_{suf}', f'outputs/{m}/{m}_MMBench_TEST_CN_{suf}'
|
| 88 |
+
return osp.exists(s1) and osp.exists(s2)
|
| 89 |
+
return False
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def DLIST(lvl):
|
| 93 |
+
if lvl in dataset_levels.keys():
|
| 94 |
+
return [x[0] for x in dataset_levels[lvl]]
|
| 95 |
+
else:
|
| 96 |
+
from vlmeval.dataset import SUPPORTED_DATASETS
|
| 97 |
+
return SUPPORTED_DATASETS
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def MLIST(lvl, size='all'):
|
| 101 |
+
if lvl == 'all':
|
| 102 |
+
from vlmeval.config import supported_VLM
|
| 103 |
+
return [x for x in supported_VLM]
|
| 104 |
+
|
| 105 |
+
model_list = models[lvl]
|
| 106 |
+
if size == 'small':
|
| 107 |
+
model_list = [m for m in model_list if m not in LARGE_MODELS]
|
| 108 |
+
elif size == 'large':
|
| 109 |
+
model_list = [m for m in model_list if m in LARGE_MODELS]
|
| 110 |
+
return [x[0] for x in model_list]
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def MISSING(lvl):
|
| 114 |
+
from vlmeval.config import supported_VLM
|
| 115 |
+
models = list(supported_VLM)
|
| 116 |
+
models = [m for m in models if m not in SKIP_MODELS and osp.exists(osp.join('outputs', m))]
|
| 117 |
+
if lvl in dataset_levels.keys():
|
| 118 |
+
data_list = dataset_levels[lvl]
|
| 119 |
+
else:
|
| 120 |
+
data_list = [(D, suff) for (D, suff) in dataset_levels['l123'] if D == lvl]
|
| 121 |
+
missing_list = []
|
| 122 |
+
for f in models:
|
| 123 |
+
for D, suff in data_list:
|
| 124 |
+
if not completed(f, D, suff):
|
| 125 |
+
missing_list.append((f, D))
|
| 126 |
+
return missing_list
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
def CIRCULAR(inp):
|
| 130 |
+
assert inp.endswith('.tsv')
|
| 131 |
+
data = load(inp)
|
| 132 |
+
OFFSET = 1e6
|
| 133 |
+
while max(data['index']) >= OFFSET:
|
| 134 |
+
OFFSET *= 10
|
| 135 |
+
|
| 136 |
+
assert 'E' not in data, 'Currently build_circular only works for up to 4-choice questions'
|
| 137 |
+
data_2c = data[pd.isna(data['C'])]
|
| 138 |
+
data_3c = data[~pd.isna(data['C']) & pd.isna(data['D'])]
|
| 139 |
+
data_4c = data[~pd.isna(data['D'])]
|
| 140 |
+
map_2c = [('AB', 'BA')]
|
| 141 |
+
map_3c = [('ABC', 'BCA'), ('ABC', 'CAB')]
|
| 142 |
+
map_4c = [('ABCD', 'BCDA'), ('ABCD', 'CDAB'), ('ABCD', 'DABC')]
|
| 143 |
+
|
| 144 |
+
def okn(o, n=4):
|
| 145 |
+
ostr = o.replace(',', ' ')
|
| 146 |
+
osplits = ostr.split()
|
| 147 |
+
if sum([c in osplits for c in string.ascii_uppercase[:n - 1]]) == n - 1:
|
| 148 |
+
return False
|
| 149 |
+
olower = o.lower()
|
| 150 |
+
olower = olower.replace(',', ' ')
|
| 151 |
+
olower_splits = olower.split()
|
| 152 |
+
if 'all' in olower_splits or 'none' in olower_splits:
|
| 153 |
+
return False
|
| 154 |
+
return True
|
| 155 |
+
|
| 156 |
+
yay4, nay4 = [], []
|
| 157 |
+
lt4 = len(data_4c)
|
| 158 |
+
for i in range(lt4):
|
| 159 |
+
if okn(data_4c.iloc[i]['D'], 4):
|
| 160 |
+
yay4.append(i)
|
| 161 |
+
else:
|
| 162 |
+
nay4.append(i)
|
| 163 |
+
data_4c_y = data_4c.iloc[yay4]
|
| 164 |
+
data_4c_n = data_4c.iloc[nay4]
|
| 165 |
+
data_3c = pd.concat([data_4c_n, data_3c])
|
| 166 |
+
|
| 167 |
+
yay3, nay3 = [], []
|
| 168 |
+
lt3 = len(data_3c)
|
| 169 |
+
for i in range(lt3):
|
| 170 |
+
if okn(data_3c.iloc[i]['C'], 3):
|
| 171 |
+
yay3.append(i)
|
| 172 |
+
else:
|
| 173 |
+
nay3.append(i)
|
| 174 |
+
data_3c_y = data_3c.iloc[yay3]
|
| 175 |
+
data_3c_n = data_3c.iloc[nay3]
|
| 176 |
+
data_2c = pd.concat([data_3c_n, data_2c])
|
| 177 |
+
|
| 178 |
+
def remap(data_in, tup, off):
|
| 179 |
+
off = int(off)
|
| 180 |
+
data = data_in.copy()
|
| 181 |
+
char_map = {k: v for k, v in zip(*tup)}
|
| 182 |
+
idx = data.pop('index')
|
| 183 |
+
answer = data.pop('answer')
|
| 184 |
+
answer_new = [char_map[x] if x in char_map else x for x in answer]
|
| 185 |
+
data['answer'] = answer_new
|
| 186 |
+
options = {}
|
| 187 |
+
for c in char_map:
|
| 188 |
+
options[char_map[c]] = data.pop(c)
|
| 189 |
+
for c in options:
|
| 190 |
+
data[c] = options[c]
|
| 191 |
+
data.pop('image')
|
| 192 |
+
data['image'] = idx
|
| 193 |
+
idx = [x + off for x in idx]
|
| 194 |
+
data['index'] = idx
|
| 195 |
+
return data
|
| 196 |
+
|
| 197 |
+
data_all = pd.concat([
|
| 198 |
+
data_2c,
|
| 199 |
+
data_3c_y,
|
| 200 |
+
data_4c_y,
|
| 201 |
+
remap(data_2c, map_2c[0], OFFSET),
|
| 202 |
+
remap(data_3c_y, map_3c[0], OFFSET),
|
| 203 |
+
remap(data_4c_y, map_4c[0], OFFSET),
|
| 204 |
+
remap(data_3c_y, map_3c[1], OFFSET * 2),
|
| 205 |
+
remap(data_4c_y, map_4c[1], OFFSET * 2),
|
| 206 |
+
remap(data_4c_y, map_4c[2], OFFSET * 3),
|
| 207 |
+
])
|
| 208 |
+
|
| 209 |
+
tgt_file = inp.replace('.tsv', '_CIRC.tsv')
|
| 210 |
+
dump(data_all, tgt_file)
|
| 211 |
+
print(f'The circularized data is saved to {tgt_file}')
|
| 212 |
+
assert osp.exists(tgt_file)
|
| 213 |
+
print(f'The MD5 for the circularized data is {md5(tgt_file)}')
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
PTH = osp.realpath(__file__)
|
| 217 |
+
IMAGE_PTH = osp.join(osp.dirname(PTH), '../assets/apple.jpg')
|
| 218 |
+
|
| 219 |
+
msg1 = [
|
| 220 |
+
IMAGE_PTH,
|
| 221 |
+
'What is in this image?'
|
| 222 |
+
]
|
| 223 |
+
msg2 = [
|
| 224 |
+
dict(type='image', value=IMAGE_PTH),
|
| 225 |
+
dict(type='text', value='What is in this image?')
|
| 226 |
+
]
|
| 227 |
+
msg3 = [
|
| 228 |
+
IMAGE_PTH,
|
| 229 |
+
IMAGE_PTH,
|
| 230 |
+
'How many apples are there in these images?'
|
| 231 |
+
]
|
| 232 |
+
msg4 = [
|
| 233 |
+
dict(type='image', value=IMAGE_PTH),
|
| 234 |
+
dict(type='image', value=IMAGE_PTH),
|
| 235 |
+
dict(type='text', value='How many apples are there in these images?')
|
| 236 |
+
]
|
| 237 |
+
|
| 238 |
+
|
| 239 |
+
def CHECK(val):
|
| 240 |
+
if val in supported_VLM:
|
| 241 |
+
model = supported_VLM[val]()
|
| 242 |
+
print(f'Model: {val}')
|
| 243 |
+
for i, msg in enumerate([msg1, msg2, msg3, msg4]):
|
| 244 |
+
if i > 1 and not model.INTERLEAVE:
|
| 245 |
+
continue
|
| 246 |
+
res = model.generate(msg)
|
| 247 |
+
print(f'Test {i + 1}: {res}')
|
| 248 |
+
elif val in models:
|
| 249 |
+
model_list = models[val]
|
| 250 |
+
for m in model_list:
|
| 251 |
+
CHECK(m)
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
def LOCALIZE(fname, new_fname=None):
|
| 255 |
+
if new_fname is None:
|
| 256 |
+
new_fname = fname.replace('.tsv', '_local.tsv')
|
| 257 |
+
|
| 258 |
+
base_name = osp.basename(fname)
|
| 259 |
+
dname = osp.splitext(base_name)[0]
|
| 260 |
+
|
| 261 |
+
data = load(fname)
|
| 262 |
+
data_new = localize_df(data, dname)
|
| 263 |
+
dump(data_new, new_fname)
|
| 264 |
+
print(f'The localized version of data file is {new_fname}')
|
| 265 |
+
return new_fname
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
def RUN(lvl, model):
|
| 269 |
+
import torch
|
| 270 |
+
NGPU = torch.cuda.device_count()
|
| 271 |
+
SCRIPT = osp.join(osp.dirname(__file__), '../run.py')
|
| 272 |
+
logger = get_logger('Run Missing')
|
| 273 |
+
|
| 274 |
+
def get_env(name):
|
| 275 |
+
assert name in ['433', '437', '440', 'latest']
|
| 276 |
+
load_env()
|
| 277 |
+
env_key = f'ENV_{name}'
|
| 278 |
+
return os.environ.get(env_key, None)
|
| 279 |
+
|
| 280 |
+
missing = MISSING(lvl)
|
| 281 |
+
if model == 'all':
|
| 282 |
+
pass
|
| 283 |
+
elif model == 'api':
|
| 284 |
+
missing = [x for x in missing if x[0] in models['api']]
|
| 285 |
+
elif model == 'hf':
|
| 286 |
+
missing = [x for x in missing if x[0] not in models['api']]
|
| 287 |
+
elif model in models:
|
| 288 |
+
missing = [x for x in missing if x[0] in models[missing]]
|
| 289 |
+
elif model in supported_VLM:
|
| 290 |
+
missing = [x for x in missing if x[0] == model]
|
| 291 |
+
else:
|
| 292 |
+
warnings.warn(f'Invalid model {model}.')
|
| 293 |
+
|
| 294 |
+
missing.sort(key=lambda x: x[0])
|
| 295 |
+
groups = defaultdict(list)
|
| 296 |
+
for m, D in missing:
|
| 297 |
+
groups[m].append(D)
|
| 298 |
+
for m in groups:
|
| 299 |
+
if m in SKIP_MODELS:
|
| 300 |
+
continue
|
| 301 |
+
for dataset in groups[m]:
|
| 302 |
+
logger.info(f'Running {m} on {dataset}')
|
| 303 |
+
exe = 'python' if m in LARGE_MODELS or m in models['api'] else 'torchrun'
|
| 304 |
+
if m not in models['api']:
|
| 305 |
+
env = None
|
| 306 |
+
env = 'latest' if m in models['latest'] else env
|
| 307 |
+
env = '433' if m in models['4.33.0'] else env
|
| 308 |
+
env = '437' if m in models['4.37.0'] else env
|
| 309 |
+
env = '440' if m in models['4.40.0'] else env
|
| 310 |
+
if env is None:
|
| 311 |
+
# Not found, default to latest
|
| 312 |
+
env = 'latest'
|
| 313 |
+
logger.warning(
|
| 314 |
+
f"Model {m} does not have a specific environment configuration. Defaulting to 'latest'.")
|
| 315 |
+
pth = get_env(env)
|
| 316 |
+
if pth is not None:
|
| 317 |
+
exe = osp.join(pth, 'bin', exe)
|
| 318 |
+
else:
|
| 319 |
+
logger.warning(f'Cannot find the env path {env} for model {m}')
|
| 320 |
+
if exe.endswith('torchrun'):
|
| 321 |
+
cmd = f'{exe} --nproc-per-node={NGPU} {SCRIPT} --model {m} --data {dataset}'
|
| 322 |
+
elif exe.endswith('python'):
|
| 323 |
+
cmd = f'{exe} {SCRIPT} --model {m} --data {dataset}'
|
| 324 |
+
os.system(cmd)
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
def EVAL(dataset_name, data_file, **kwargs):
|
| 328 |
+
from vlmeval.dataset import build_dataset
|
| 329 |
+
logger = get_logger('VLMEvalKit Tool-Eval')
|
| 330 |
+
dataset = build_dataset(dataset_name)
|
| 331 |
+
# Set the judge kwargs first before evaluation or dumping
|
| 332 |
+
judge_kwargs = {'nproc': 4, 'verbose': True}
|
| 333 |
+
if 'model' not in kwargs:
|
| 334 |
+
if dataset.TYPE in ['MCQ', 'Y/N']:
|
| 335 |
+
judge_kwargs['model'] = 'chatgpt-0125'
|
| 336 |
+
elif listinstr(['MMVet', 'LLaVABench', 'MMBench-Video'], dataset_name):
|
| 337 |
+
judge_kwargs['model'] = 'gpt-4-turbo'
|
| 338 |
+
elif listinstr(['MMLongBench', 'MMDU'], dataset_name):
|
| 339 |
+
judge_kwargs['model'] = 'gpt-4o'
|
| 340 |
+
elif listinstr(['DynaMath', 'MathVerse', 'MathVista', 'MathVision'], dataset_name):
|
| 341 |
+
judge_kwargs['model'] = 'gpt-4o-mini'
|
| 342 |
+
else:
|
| 343 |
+
judge_kwargs['model'] = kwargs['model']
|
| 344 |
+
judge_kwargs['nproc'] = kwargs.get('nproc', 4)
|
| 345 |
+
eval_results = dataset.evaluate(data_file, **judge_kwargs)
|
| 346 |
+
if eval_results is not None:
|
| 347 |
+
assert isinstance(eval_results, dict) or isinstance(eval_results, pd.DataFrame)
|
| 348 |
+
logger.info('Evaluation Results:')
|
| 349 |
+
if isinstance(eval_results, dict):
|
| 350 |
+
logger.info('\n' + json.dumps(eval_results, indent=4))
|
| 351 |
+
elif isinstance(eval_results, pd.DataFrame):
|
| 352 |
+
logger.info('\n')
|
| 353 |
+
logger.info(tabulate(eval_results.T) if len(eval_results) < len(eval_results.columns) else eval_results)
|
| 354 |
+
return eval_results
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
def parse_args_eval():
|
| 358 |
+
parser = argparse.ArgumentParser()
|
| 359 |
+
# Essential Args, Setting the Names of Datasets and Models
|
| 360 |
+
parser.add_argument('cmd', type=str)
|
| 361 |
+
parser.add_argument('data_file', type=str)
|
| 362 |
+
parser.add_argument('--judge', type=str, default=None)
|
| 363 |
+
parser.add_argument('--nproc', type=int, default=4)
|
| 364 |
+
parser.add_argument('--retry', type=int, default=None)
|
| 365 |
+
args = parser.parse_args()
|
| 366 |
+
return args
|
| 367 |
+
|
| 368 |
+
|
| 369 |
+
def MERGE_PKL(pkl_dir, world_size=1):
|
| 370 |
+
prefs = []
|
| 371 |
+
for ws in list(range(1, 9)):
|
| 372 |
+
prefs.extend([f'{i}{ws}_' for i in range(ws)])
|
| 373 |
+
prefs = set(prefs)
|
| 374 |
+
files = os.listdir(pkl_dir)
|
| 375 |
+
files = [x for x in files if x[:3] in prefs]
|
| 376 |
+
# Merge the files
|
| 377 |
+
res_all = defaultdict(dict)
|
| 378 |
+
for f in files:
|
| 379 |
+
full_path = osp.join(pkl_dir, f)
|
| 380 |
+
key = f[3:]
|
| 381 |
+
res_all[key].update(load(full_path))
|
| 382 |
+
os.remove(full_path)
|
| 383 |
+
|
| 384 |
+
dump_prefs = [f'{i}{world_size}_' for i in range(world_size)]
|
| 385 |
+
for k in res_all:
|
| 386 |
+
for pf in dump_prefs:
|
| 387 |
+
dump(res_all[k], f'{pkl_dir}/{pf}{k}')
|
| 388 |
+
print(f'Merged {len(res_all[k])} records into {pkl_dir}/{dump_prefs[0]}{k}')
|
| 389 |
+
|
| 390 |
+
|
| 391 |
+
def cli():
|
| 392 |
+
logger = get_logger('VLMEvalKit Tools')
|
| 393 |
+
args = sys.argv[1:]
|
| 394 |
+
if not args: # no arguments passed
|
| 395 |
+
logger.info(CLI_HELP_MSG)
|
| 396 |
+
return
|
| 397 |
+
|
| 398 |
+
if args[0].lower() == 'dlist':
|
| 399 |
+
assert len(args) >= 2
|
| 400 |
+
lst = DLIST(args[1])
|
| 401 |
+
print(' '.join(lst))
|
| 402 |
+
elif args[0].lower() == 'mlist':
|
| 403 |
+
assert len(args) >= 2
|
| 404 |
+
size = 'all'
|
| 405 |
+
if len(args) > 2:
|
| 406 |
+
size = args[2].lower()
|
| 407 |
+
lst = MLIST(args[1], size)
|
| 408 |
+
print('\n'.join(lst))
|
| 409 |
+
elif args[0].lower() == 'missing':
|
| 410 |
+
assert len(args) >= 2
|
| 411 |
+
missing_list = MISSING(args[1])
|
| 412 |
+
logger = get_logger('Find Missing')
|
| 413 |
+
logger.info(colored(f'Level {args[1]} Missing Results: ', 'red'))
|
| 414 |
+
lines = []
|
| 415 |
+
for m, D in missing_list:
|
| 416 |
+
line = f'Model {m}, Dataset {D}'
|
| 417 |
+
logger.info(colored(line, 'red'))
|
| 418 |
+
lines.append(line)
|
| 419 |
+
mwlines(lines, f'{args[1]}_missing.txt')
|
| 420 |
+
elif args[0].lower() == 'circular':
|
| 421 |
+
assert len(args) >= 2
|
| 422 |
+
CIRCULAR(args[1])
|
| 423 |
+
elif args[0].lower() == 'localize':
|
| 424 |
+
assert len(args) >= 2
|
| 425 |
+
LOCALIZE(args[1])
|
| 426 |
+
elif args[0].lower() == 'check':
|
| 427 |
+
assert len(args) >= 2
|
| 428 |
+
model_list = args[1:]
|
| 429 |
+
for m in model_list:
|
| 430 |
+
CHECK(m)
|
| 431 |
+
elif args[0].lower() == 'run':
|
| 432 |
+
assert len(args) >= 2
|
| 433 |
+
lvl = args[1]
|
| 434 |
+
if len(args) == 2:
|
| 435 |
+
model = 'all'
|
| 436 |
+
RUN(lvl, model)
|
| 437 |
+
else:
|
| 438 |
+
for model in args[2:]:
|
| 439 |
+
RUN(lvl, model)
|
| 440 |
+
elif args[0].lower() == 'eval':
|
| 441 |
+
args = parse_args_eval()
|
| 442 |
+
data_file = args.data_file
|
| 443 |
+
|
| 444 |
+
def extract_dataset(file_name):
|
| 445 |
+
fname = osp.splitext(file_name)[0].split('/')[-1]
|
| 446 |
+
parts = fname.split('_')
|
| 447 |
+
for i in range(len(parts)):
|
| 448 |
+
if '_'.join(parts[i:]) in SUPPORTED_DATASETS:
|
| 449 |
+
return '_'.join(parts[i:])
|
| 450 |
+
return None
|
| 451 |
+
|
| 452 |
+
dataset = extract_dataset(data_file)
|
| 453 |
+
assert dataset is not None, f'Cannot infer dataset name from {data_file}'
|
| 454 |
+
kwargs = {'nproc': args.api_nproc}
|
| 455 |
+
if args.judge is not None:
|
| 456 |
+
kwargs['model'] = args.judge
|
| 457 |
+
if args.retry is not None:
|
| 458 |
+
kwargs['retry'] = args.retry
|
| 459 |
+
EVAL(dataset_name=dataset, data_file=data_file, **kwargs)
|
| 460 |
+
elif args[0].lower() == 'merge_pkl':
|
| 461 |
+
assert len(args) == 3
|
| 462 |
+
args[2] = int(args[2])
|
| 463 |
+
assert args[2] in [1, 2, 4, 8]
|
| 464 |
+
MERGE_PKL(args[1], args[2])
|
| 465 |
+
else:
|
| 466 |
+
logger.error('WARNING: command error!')
|
| 467 |
+
logger.info(CLI_HELP_MSG)
|
| 468 |
+
return
|