Omni-View / EVAL.md

Update EVAL.md

673293b verified about 1 month ago

8.73 kB

	# VLM

	## 3D Scene Understanding

	### Data prepration

	Download json files from [here](https://huggingface.co/AIDC-AI/Omni-View/tree/main/eval_dataset). Move the json files to `./dataset/eval/3dscene/`

	Download metadata from [EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan/tree/main/data). You need to fill out the official form to get the access to the dataset. Move the `embodiedscan_infos_*.pkl` to `./dataset/eval/embodiedscan`.

	Download images from [Video3DLLM](https://huggingface.co/datasets/zd11024/Video-3D-LLM_data). Then,

	```shell
	cd Video-3D-LLM_data
	# unzip posed images
	cat posed_images_part* > posed_images.tar.gz
	tar -xzf posed_images.tar.gz
	# unzip mask
	unzip mask.zip
	# unzip pcd
	tar -xzf pcd_with_object_aabbs.tar.gz

	mkdir scannet
	mv posed_images/ scannet/
	mv mask/ scannet/
	mv data/scannet/pcd_with_object_aabbs/ scannet/
	```

	Move the `scannet` to `./dataset/eval/`.

	The whole file structure under `./dataset/eval/` will be as follows.

	```shell
	./dataset/eval/
	├── 3dscene
	│ ├── scannet_det_val_4frames.json
	│ ├── scannet_select_frames.json
	│ ├── scanqa_val_llava_style.json
	│ ├── scanrefer_val_32frames.json
	│ └── sqa3d_test_llava_style.json
	├── embodiedscan
	│ ├── embodiedscan_infos_test.pkl
	│ ├── embodiedscan_infos_train.pkl
	│ └── embodiedscan_infos_val.pkl
	└── scannet
	├── mask
	├── pcd_with_object_aabbs
	└── posed_images

	6 directories, 9 files
	```

	### Inference

	```shell
	torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset sqa3d
	torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset scanqa
	torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset scanrefer
	torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset 3ddet
	```

	The results (`*.json` files) will be saved in `./results/`.

	### Evaluation

	#### SQA3D

	```shell
	python eval/vlm/eval/vqa/3dvqa_eval.py --dataset sqa3d --input-file ./results/sqa3d.json

	# output
	EM-all: 59.16453537936914
	EM-what: 51.787271142109844
	EM-which: 50.427350427350426
	EM-can: 68.04733727810651
	EM-is: 73.15950920245399
	EM-how: 60.86021505376345
	EM-others: 56.71378091872792

	EM-R-all: 62.432509235578294
	EM-R-what: 58.326068003487364
	EM-R-which: 51.566951566951566
	EM-R-can: 68.04733727810651
	EM-R-is: 75.61349693251533
	EM-R-how: 61.29032258064516
	EM-R-others: 59.8939929328622
	```

	#### ScanQA

	```shell
	python eval/vlm/eval/vqa/3dvqa_eval.py --dataset scanqa --input-file ./results/scanqa.json

	# output
	CIDER: 103.01606173581641
	BLEU: 48.355630543562654, 32.50084859567164, 23.329984202585262, 16.194281050888
	METEOR: 20.101447665118403
	Rouge: 49.04130603336627
	EM: 29.497326203208555
	```

	#### ScanRefer

	```shell
	python eval/vlm/eval/vqa/3dgrounding_eval.py --input-file ./results/scanrefer.json

	# output
	all iou@0.25: 50.82036180058898
	all iou@0.5: 45.0462768195204
	multiple iou@0.25: 44.877985123319846
	multiple iou@0.5: 39.34490408456218
	unique iou@0.25: 75.50135501355012
	unique iou@0.5: 68.72628726287263
	```

	#### 3D Detection

	```shell
	python eval/vlm/eval/vqa/3ddet_eval.py --input-file ./results/3ddet.json

	# output
	+Metrics Per Category------+--------+----------+
	\| Category \| Precision \| Recall \| F1 Score \|
	+--------------+-----------+--------+----------+
	\| chair \| 0.4807 \| 0.5074 \| 0.4937 \|
	\| pillow \| 0.1683 \| 0.1881 \| 0.1777 \|
	\| cabinet \| 0.1401 \| 0.1608 \| 0.1497 \|
	\| table \| 0.3744 \| 0.3528 \| 0.3633 \|
	\| lamp \| 0.1302 \| 0.0986 \| 0.1122 \|
	\| couch \| 0.4795 \| 0.4862 \| 0.4828 \|
	\| desk \| 0.3567 \| 0.3896 \| 0.3724 \|
	\| stand \| 0.4167 \| 0.3509 \| 0.3810 \|
	\| bed \| 0.7468 \| 0.6797 \| 0.7117 \|
	\| backpack \| 0.3130 \| 0.3628 \| 0.3361 \|
	\| bathtub \| 0.4343 \| 0.4000 \| 0.4164 \|
	\| ottoman \| 0.1250 \| 0.0714 \| 0.0909 \|
	\| dresser \| 0.4828 \| 0.3825 \| 0.4268 \|
	\| bin \| 0.3727 \| 0.3162 \| 0.3421 \|
	\| toilet \| 0.7720 \| 0.7395 \| 0.7554 \|
	\| refrigerator \| 0.3486 \| 0.4176 \| 0.3800 \|
	\| stove \| 0.7826 \| 0.7347 \| 0.7579 \|
	\| microwave \| 0.2453 \| 0.1884 \| 0.2131 \|
	\| monitor \| 0.2422 \| 0.2770 \| 0.2585 \|
	\| computer \| 0.1546 \| 0.0968 \| 0.1190 \|
	\| window \| 0.1297 \| 0.0997 \| 0.1127 \|
	\| shelf \| 0.1939 \| 0.2184 \| 0.2054 \|
	\| curtain \| 0.1260 \| 0.1291 \| 0.1275 \|
	\| plant \| 0.1538 \| 0.0855 \| 0.1099 \|
	\| stairs \| 0.3243 \| 0.4000 \| 0.3582 \|
	\| picture \| 0.0212 \| 0.0212 \| 0.0212 \|
	\| book \| 0.0348 \| 0.0629 \| 0.0448 \|
	\| bottle \| 0.0247 \| 0.0284 \| 0.0264 \|
	\| lamp \| 0.1302 \| 0.0986 \| 0.1122 \|
	\| towl \| 0.0000 \| 0.0000 \| 0.0000 \|
	\| sink \| 0.4752 \| 0.4467 \| 0.4605 \|
	+--------------+-----------+--------+----------+
	+--------+---------------+------------+--------+
	\| Split \| Avg Precision \| Avg Recall \| Avg F1 \|
	+--------+---------------+------------+--------+
	\| cate8 \| 0.4751 \| 0.4553 \| 0.4644 \|
	\| cate20 \| 0.3783 \| 0.3601 \| 0.3670 \|
	\| cate31 \| 0.2961 \| 0.2836 \| 0.2877 \|
	+--------+---------------+------------+--------+
	```

	> During evaluation, the error `ERROR \| __main__:threedod_process_results:357 - Error parsing prediction bbox` may appear. This error does not affect the evaluation.



	## VSI-Bench

	### Data prepration

	Download [VSI-Bench](https://huggingface.co/datasets/nyu-visionx/VSI-Bench), put it in `dataset/eval`.

	### Inference

	```shell
	torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_vsibench --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset vsibench
	```

	The results (`*.json` files) will be saved in `./results/`.

	### Evaluation

	```shell
	python eval/vlm/eval/vqa/3dvqa_eval.py --dataset vsibench --input-file ./results/vsibench.json

	# output
	obj_appearance_order_accuracy: 49.029126213592235
	object_abs_distance_MRA:.5:.95:.05: 46.402877697841724
	object_counting_MRA:.5:.95:.05: 70.31858407079646
	object_rel_distance_accuracy: 65.91549295774648
	object_size_estimation_MRA:.5:.95:.05: 68.59391395592864
	room_size_estimation_MRA:.5:.95:.05: 54.722222222222214
	route_planning_accuracy: 33.50515463917525
	object_rel_direction_accuracy: 54.404572390100945
	overall: 55.3614930184255
	```

	## Novel View Synthesis

	### Data prepration

	Download the RealEstate10K dataset from [this link](http://schadenfreude.csail.mit.edu:8000/), which is provided by [pixelSplat](https://github.com/dcharatan/pixelsplat), and `unzip` the zip file and put the data in `YOUR_RAW_DATAPATH`.

	Run the following command to preprocess the data into our format.

	```shell
	git clone https://github.com/zalkklop/LVSM.git
	cd LVSM
	python process_data.py --base_path YOUR_RAW_DATAPATH --output_dir ./dataset/eval/re10k/ --mode ['train' or 'test']
	```

	The whole file structure under `./dataset/eval/re10k/test/` will be as follows.

	```
	./dataset/eval/re10k/test/
	├── full_list.txt
	├── images
	│ ├── 000c3ab189999a83
	│ ├── ...
	├── metadata
	│ ├── 000c3ab189999a83.json
	│ ├── ...
	```

	### Evaluation

	We provide a script to evaluate Omni-View on [RE10k](https://google.github.io/realestate10k/).

	```shell
	python inference.py --scene-id 000c3ab189999a83
	```

	\| Argument \| Default \| Description \|
	\| ---------------------------- \| ---------------------- \| ---------------------------------------------------------------- \|
	\| `scene-id` \| None \| The scene id in RE10k. \|
	\| `pose-id` \| None \| The id of camera trajectory in RE10k. Default: pose_id = scene_id \|
	\| `image-path` \| None \| The reference image path. \|

	If `scene-id != pose-id`, we will use the first image of scene-id as the reference image and generate novel views using the camera trajectory of pose-id.

	If `(scene-id is None) and (image-path is not None)`, we will use the image in image-path as the reference image and generate novel views using the camera trajectory of pose-id.