|
|
---
|
|
|
license: mit
|
|
|
---
|
|
|
|
|
|
## Model Summary
|
|
|
|
|
|
Video-CCAM-4B is a lightweight Video-MLLM built on [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384). **Note**: Here [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) refers to the previous version, which requires `git commit id ff07dc01615f8113924aed013115ab2abd32115b` to get the checkpoint.
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
|
|
|
```
|
|
|
torch==2.1.0
|
|
|
torchvision==0.16.0
|
|
|
transformers==4.40.2
|
|
|
peft==0.10.0
|
|
|
```
|
|
|
|
|
|
## Inference & Evaluation
|
|
|
|
|
|
Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) on inference and evaluation.
|
|
|
|
|
|
### Video-MME
|
|
|
|
|
|
|#Frames.|32|96|
|
|
|
|:-:|:-:|:-:|
|
|
|
|w/o subs|48.2|49.6|
|
|
|
|w subs|51.7|53.0|
|
|
|
|
|
|
### MVBench: 57.78 (16 frames)
|
|
|
|
|
|
## Acknowledgement
|
|
|
|
|
|
* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-4B is trained using the xtuner framework. Thanks for their excellent works!
|
|
|
* [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct): Powerful language models developed by Microsoft.
|
|
|
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.
|
|
|
|
|
|
## License
|
|
|
The model is licensed under the MIT license.
|
|
|
|