metadata
license: mit
Model Summary
Video-CCAM-9B is a Video-MLLM built on Yi-1.5-9B-Chat and SigLIP SO400M.
Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
torch==2.1.0
torchvision==0.16.0
transformers==4.40.2
peft==0.10.0
Inference & Evaluation
Please refer to Video-CCAM on inference and evaluation.
Video-MME
| #Frames. | 32 | 96 |
|---|---|---|
| w/o subs | 50.0 | 50.6 |
| w subs | 53.1 | 54.9 |
MVBench: 60.70 (16 frames)
Acknowledgement
- xtuner: Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works!
- Yi-1.5-9B-Chat: Great language models developed by 01.AI.
- SigLIP SO400M: Outstanding vision encoder developed by Google.
License
The model is licensed under the MIT license.