LLaVA-NeXT: A Strong Zero-shot Video Understanding Model
Contents
Demo
make sure you installed the LLaVA-NeXT model files via outside REAME.md
Example model:
lmms-lab/LLaVA-NeXT-Video-7B-DPOPrompt mode:
vicuna_v1(usemistral_directforlmms-lab/LLaVA-NeXT-Video-34B-DPO)Sampled frames:
32(Defines how many frames to sample from the video.)Spatial pooling stride:
2(With original tokens for one frame at 24x24, if stride=2, then the tokens for one frame are 12x12.)Spatial pooling mode:
average(Options:average,max.)Local video path:
./data/llava_video/video-chatgpt/evaluation/Test_Videos/v_Lf_7RurLgp0.mp4
To run a demo, execute:
bash scripts/video/demo/video_demo.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} ${Spatial pooling mode} grid True ${Video path at local}
Example:
bash scripts/video/demo/video_demo.sh lmms-lab/LLaVA-NeXT-Video-7B-DPO vicuna_v1 32 2 average no_token True playground/demo/xU25MMA2N4aVtYay.mp4
IMPORTANT Please refer to Latest video model for the runnning of the latest model.
Evaluation
Preparation
Please download the evaluation data and its metadata from the following links:
Organize the downloaded data into the following structure:
LLaVA-NeXT
βββ llava
βββ scripts
βββ data
βββ llava_video
βββ video-chatgpt
β βββ Test_Videos
β βββ consistency_qa.json
β βββ consistency_qa_test.json
β βββ consistency_qa_train.json
βββ video_detail_description
β βββ Test_Human_Annotated_Captions
βββ ActivityNet-QA
βββ all_test
βββ test_a.json
βββ test_b.json
Inference and Evaluation
Example for video detail description evaluation (additional scripts are available in scripts/eval):
bash scripts/video/eval/video_detail_description_eval_shard.sh ${Example model} ${Prompt mode} ${Sampled frames} ${Spatial pooling stride} True 8
Example:
bash scripts/eval/video_detail_description_eval_shard.sh liuhaotian/llava-v1.6-vicuna-7b vicuna_v1 32 2 True 8
GPT Evaluation Example (Optional if the above step is completed)
Assuming you have pred.json (model-generated predictions) for model llava-v1.6-vicuna-7b at ./work_dirs/eval_video_detail_description/llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2:
bash scripts/video/eval/video_description_eval_only.sh llava-v1.6-vicuna-7b_vicuna_v1_frames_32_stride_2