File size: 1,155 Bytes
36207d4
 
 
9ddc7da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3462e1a
9ddc7da
 
 
 
 
 
 
 
 
 
614c648
3462e1a
9ddc7da
 
3e4309d
9ddc7da
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---

license: mit
---


## Model Summary

Video-CCAM-9B is a Video-MLLM built on [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384).

## Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
```

torch==2.1.0

torchvision==0.16.0

transformers==4.40.2

peft==0.10.0

```

## Inference & Evaluation

Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) on inference and evaluation.

### Video-MME

|#Frames.|32|96|
|:-:|:-:|:-:|
|w/o subs|50.0|50.6|
|w subs|53.1|54.9|

### MVBench: 60.70 (16 frames)

## Acknowledgement

* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works!
* [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat): Great language models developed by [01.AI](https://www.lingyiwanwu.com/).
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.

## License
The model is licensed under the MIT license.