File size: 1,155 Bytes
36207d4 9ddc7da 3462e1a 9ddc7da 614c648 3462e1a 9ddc7da 3e4309d 9ddc7da |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
---
license: mit
---
## Model Summary
Video-CCAM-9B is a Video-MLLM built on [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384).
## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
```
torch==2.1.0
torchvision==0.16.0
transformers==4.40.2
peft==0.10.0
```
## Inference & Evaluation
Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) on inference and evaluation.
### Video-MME
|#Frames.|32|96|
|:-:|:-:|:-:|
|w/o subs|50.0|50.6|
|w subs|53.1|54.9|
### MVBench: 60.70 (16 frames)
## Acknowledgement
* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works!
* [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat): Great language models developed by [01.AI](https://www.lingyiwanwu.com/).
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.
## License
The model is licensed under the MIT license. |