File size: 2,566 Bytes
c8d5956
 
 
b546355
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---

license: mit
---


## Model Summary

Video-CCAM-9B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.

## Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
```

pip install -U pip torch transformers peft decord pysubs2 imageio

```

## Inference

```

import os

import torch

from PIL import Image

from transformers import AutoModel



from eval import load_decord



os.environ['TOKENIZERS_PARALLELISM'] = 'false'



videoccam = AutoModel.from_pretrained(

    '<your_local_path_1>',

    trust_remote_code=True,

    torch_dtype=torch.bfloat16,

    device_map='auto',

    _attn_implementation='flash_attention_2',

    # llm_name_or_path='<your_local_llm_path>',

    # vision_encoder_name_or_path='<your_local_vision_encoder_path>'

)





messages = [

    [

        {

            'role': 'user',

            'content': '<image>\nDescribe this image in detail.'

        }

    ], [

        {

            'role': 'user',

            'content': '<video>\nDescribe this video in detail.'

        }

    ]

]



images = [

    Image.open('assets/example_image.jpg').convert('RGB'),

    load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)

]



response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)



print(response)

```

Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.

### Benchmarks

|Benchmark|Video-CCAM-9B|Video-CCAM-9B-v1.1|
|:-:|:-:|:-:|
|MVBench (32 frames)|61.08|64.60|
|MSVD-QA (32 frames)|76.9/4.1|77.9/4.2|
|MSRVTT-QA (32 frames)|58.7/3.5|65.9/3.8|
|ActivityNet-QA (32 frames)|56.2/3.6|58.7/3.8|
|TGIF-QA (32 frames)|83.9/4.4|84.0/4.5|
|Video-MME (w/o sub, 96 frames)|49.4|50.3|
|Video-MME (w sub, 96 frames)|55.2|52.6|
|MLVU (M-Avg, 96 frames)|59.4|58.5|
|MLVU (G-Avg, 96 frames)|3.91|3.98|
|VideoVista (96 frames)|64.39|69.00|

* The accuracies and scores of MSVD-QA,MSRVTT-QA,ActivityNet-QA,TGIF-QA are evaluated by `gpt-3.5-turbo-0125`.

## Acknowledgement

* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works!
* [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat): Great language models developed by [01.AI](https://www.lingyiwanwu.com/).
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.

## License
The model is licensed under the MIT license.