| # VLM | |
| ## 3D Scene Understanding | |
| ### Data prepration | |
| Download json files from [here](https://huggingface.co/AIDC-AI/Omni-View/tree/main/eval_dataset). Move the json files to `./dataset/eval/3dscene/` | |
| Download metadata from [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/tree/main/data). You need to fill out the official form to get the access to the dataset. Move the `embodiedscan_infos_*.pkl` to `./dataset/eval/embodiedscan`. | |
| Download images from [Video3DLLM](https://huggingface.co/datasets/zd11024/Video-3D-LLM_data). Then, | |
| ```shell | |
| cd Video-3D-LLM_data | |
| # unzip posed images | |
| cat posed_images_part* > posed_images.tar.gz | |
| tar -xzf posed_images.tar.gz | |
| # unzip mask | |
| unzip mask.zip | |
| # unzip pcd | |
| tar -xzf pcd_with_object_aabbs.tar.gz | |
| mkdir scannet | |
| mv posed_images/ scannet/ | |
| mv mask/ scannet/ | |
| mv data/scannet/pcd_with_object_aabbs/ scannet/ | |
| ``` | |
| Move the `scannet` to `./dataset/eval/`. | |
| The whole file structure under `./dataset/eval/` will be as follows. | |
| ```shell | |
| ./dataset/eval/ | |
| βββ 3dscene | |
| β βββ scannet_det_val_4frames.json | |
| β βββ scannet_select_frames.json | |
| β βββ scanqa_val_llava_style.json | |
| β βββ scanrefer_val_32frames.json | |
| β βββ sqa3d_test_llava_style.json | |
| βββ embodiedscan | |
| β βββ embodiedscan_infos_test.pkl | |
| β βββ embodiedscan_infos_train.pkl | |
| β βββ embodiedscan_infos_val.pkl | |
| βββ scannet | |
| βββ mask | |
| βββ pcd_with_object_aabbs | |
| βββ posed_images | |
| 6 directories, 9 files | |
| ``` | |
| ### Inference | |
| ```shell | |
| torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset sqa3d | |
| torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset scanqa | |
| torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset scanrefer | |
| torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset 3ddet | |
| ``` | |
| The results (`*.json` files) will be saved in `./results/`. | |
| ### Evaluation | |
| #### SQA3D | |
| ```shell | |
| python eval/vlm/eval/vqa/3dvqa_eval.py --dataset sqa3d --input-file ./results/sqa3d.json | |
| # output | |
| EM-all: 59.16453537936914 | |
| EM-what: 51.787271142109844 | |
| EM-which: 50.427350427350426 | |
| EM-can: 68.04733727810651 | |
| EM-is: 73.15950920245399 | |
| EM-how: 60.86021505376345 | |
| EM-others: 56.71378091872792 | |
| EM-R-all: 62.432509235578294 | |
| EM-R-what: 58.326068003487364 | |
| EM-R-which: 51.566951566951566 | |
| EM-R-can: 68.04733727810651 | |
| EM-R-is: 75.61349693251533 | |
| EM-R-how: 61.29032258064516 | |
| EM-R-others: 59.8939929328622 | |
| ``` | |
| #### ScanQA | |
| ```shell | |
| python eval/vlm/eval/vqa/3dvqa_eval.py --dataset scanqa --input-file ./results/scanqa.json | |
| # output | |
| CIDER: 103.01606173581641 | |
| BLEU: 48.355630543562654, 32.50084859567164, 23.329984202585262, 16.194281050888 | |
| METEOR: 20.101447665118403 | |
| Rouge: 49.04130603336627 | |
| EM: 29.497326203208555 | |
| ``` | |
| #### ScanRefer | |
| ```shell | |
| python eval/vlm/eval/vqa/3dgrounding_eval.py --input-file ./results/scanrefer.json | |
| # output | |
| all iou@0.25: 50.82036180058898 | |
| all iou@0.5: 45.0462768195204 | |
| multiple iou@0.25: 44.877985123319846 | |
| multiple iou@0.5: 39.34490408456218 | |
| unique iou@0.25: 75.50135501355012 | |
| unique iou@0.5: 68.72628726287263 | |
| ``` | |
| #### 3D Detection | |
| ```shell | |
| python eval/vlm/eval/vqa/3ddet_eval.py --input-file ./results/3ddet.json | |
| # output | |
| +Metrics Per Category------+--------+----------+ | |
| | Category | Precision | Recall | F1 Score | | |
| +--------------+-----------+--------+----------+ | |
| | chair | 0.4807 | 0.5074 | 0.4937 | | |
| | pillow | 0.1683 | 0.1881 | 0.1777 | | |
| | cabinet | 0.1401 | 0.1608 | 0.1497 | | |
| | table | 0.3744 | 0.3528 | 0.3633 | | |
| | lamp | 0.1302 | 0.0986 | 0.1122 | | |
| | couch | 0.4795 | 0.4862 | 0.4828 | | |
| | desk | 0.3567 | 0.3896 | 0.3724 | | |
| | stand | 0.4167 | 0.3509 | 0.3810 | | |
| | bed | 0.7468 | 0.6797 | 0.7117 | | |
| | backpack | 0.3130 | 0.3628 | 0.3361 | | |
| | bathtub | 0.4343 | 0.4000 | 0.4164 | | |
| | ottoman | 0.1250 | 0.0714 | 0.0909 | | |
| | dresser | 0.4828 | 0.3825 | 0.4268 | | |
| | bin | 0.3727 | 0.3162 | 0.3421 | | |
| | toilet | 0.7720 | 0.7395 | 0.7554 | | |
| | refrigerator | 0.3486 | 0.4176 | 0.3800 | | |
| | stove | 0.7826 | 0.7347 | 0.7579 | | |
| | microwave | 0.2453 | 0.1884 | 0.2131 | | |
| | monitor | 0.2422 | 0.2770 | 0.2585 | | |
| | computer | 0.1546 | 0.0968 | 0.1190 | | |
| | window | 0.1297 | 0.0997 | 0.1127 | | |
| | shelf | 0.1939 | 0.2184 | 0.2054 | | |
| | curtain | 0.1260 | 0.1291 | 0.1275 | | |
| | plant | 0.1538 | 0.0855 | 0.1099 | | |
| | stairs | 0.3243 | 0.4000 | 0.3582 | | |
| | picture | 0.0212 | 0.0212 | 0.0212 | | |
| | book | 0.0348 | 0.0629 | 0.0448 | | |
| | bottle | 0.0247 | 0.0284 | 0.0264 | | |
| | lamp | 0.1302 | 0.0986 | 0.1122 | | |
| | towl | 0.0000 | 0.0000 | 0.0000 | | |
| | sink | 0.4752 | 0.4467 | 0.4605 | | |
| +--------------+-----------+--------+----------+ | |
| +--------+---------------+------------+--------+ | |
| | Split | Avg Precision | Avg Recall | Avg F1 | | |
| +--------+---------------+------------+--------+ | |
| | cate8 | 0.4751 | 0.4553 | 0.4644 | | |
| | cate20 | 0.3783 | 0.3601 | 0.3670 | | |
| | cate31 | 0.2961 | 0.2836 | 0.2877 | | |
| +--------+---------------+------------+--------+ | |
| ``` | |
| > During evaluation, the error `ERROR | __main__:threedod_process_results:357 - Error parsing prediction bbox` may appear. This error does not affect the evaluation. | |
| ## VSI-Bench | |
| ### Data prepration | |
| Download [VSI-Bench](https://huggingface.co/datasets/nyu-visionx/VSI-Bench), put it in `dataset/eval`. | |
| ### Inference | |
| ```shell | |
| torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_vsibench --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset vsibench | |
| ``` | |
| The results (`*.json` files) will be saved in `./results/`. | |
| ### Evaluation | |
| ```shell | |
| python eval/vlm/eval/vqa/3dvqa_eval.py --dataset vsibench --input-file ./results/vsibench.json | |
| # output | |
| obj_appearance_order_accuracy: 49.029126213592235 | |
| object_abs_distance_MRA:.5:.95:.05: 46.402877697841724 | |
| object_counting_MRA:.5:.95:.05: 70.31858407079646 | |
| object_rel_distance_accuracy: 65.91549295774648 | |
| object_size_estimation_MRA:.5:.95:.05: 68.59391395592864 | |
| room_size_estimation_MRA:.5:.95:.05: 54.722222222222214 | |
| route_planning_accuracy: 33.50515463917525 | |
| object_rel_direction_accuracy: 54.404572390100945 | |
| overall: 55.3614930184255 | |
| ``` | |
| ## Novel View Synthesis | |
| ### Data prepration | |
| Download the RealEstate10K dataset from [this link](http://schadenfreude.csail.mit.edu:8000/), which is provided by [pixelSplat](https://github.com/dcharatan/pixelsplat), and `unzip` the zip file and put the data in `YOUR_RAW_DATAPATH`. | |
| Run the following command to preprocess the data into our format. | |
| ```shell | |
| git clone https://github.com/zalkklop/LVSM.git | |
| cd LVSM | |
| python process_data.py --base_path YOUR_RAW_DATAPATH --output_dir ./dataset/eval/re10k/ --mode ['train' or 'test'] | |
| ``` | |
| The whole file structure under `./dataset/eval/re10k/test/` will be as follows. | |
| ``` | |
| ./dataset/eval/re10k/test/ | |
| βββ full_list.txt | |
| βββ images | |
| β βββ 000c3ab189999a83 | |
| β βββ ... | |
| βββ metadata | |
| β βββ 000c3ab189999a83.json | |
| β βββ ... | |
| ``` | |
| ### Evaluation | |
| We provide a script to evaluate Omni-View on [RE10k](https://google.github.io/realestate10k/). | |
| ```shell | |
| python inference.py --scene-id 000c3ab189999a83 | |
| ``` | |
| | Argument | Default | Description | | |
| | ---------------------------- | ---------------------- | ---------------------------------------------------------------- | | |
| | `scene-id` | None | The scene id in RE10k. | | |
| | `pose-id` | None | The id of camera trajectory in RE10k. Default: pose_id = scene_id | | |
| | `image-path` | None | The reference image path. | | |
| If `scene-id != pose-id`, we will use the first image of scene-id as the reference image and generate novel views using the camera trajectory of pose-id. | |
| If `(scene-id is None) and (image-path is not None)`, we will use the image in image-path as the reference image and generate novel views using the camera trajectory of pose-id. |