Video-CCAM-7B-v1.2 / README.md
jaronfei
first commit
47990f6
---
license: apache-2.0
---
## Model Summary
Video-CCAM-7B-v1.2 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team, built upon [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384). Compared to previous versions, it has better performances on public benchmarks and supports Chinese response.
## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
```
pip install -U pip torch transformers accelerate peft decord pysubs2 imageio
# flash attention support
pip install flash-attn --no-build-isolation
```
## Inference
```
import os
import torch
from huggingface_hub import snapshot_download
from PIL import Image
from transformers import AutoModel
from eval import load_decord
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
# if you have downloaded this model, just replace the following line with your local path
model_path = snapshot_download(repo_id='JaronTHU/Video-CCAM-7B-v1.2')
videoccam = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map='cuda:0',
attn_implementation='flash_attention_2'
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
image_processor = AutoImageProcessor.from_pretrained(model_path)
messages = [
[
{
'role': 'user',
'content': '<image>\nDescribe this image in detail.'
}
], [
{
'role': 'user',
'content': '<video>\n请仔细描述这个视频。'
}
]
]
images = [
[Image.open('assets/example_image.jpg').convert('RGB')],
load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]
response = videoccam.chat(messages, images, tokenizer, image_processor, max_new_tokens=512, do_sample=False)
print(response)
```
Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.
### Benchmarks
|Benchmark|Video-CCAM-9B|Video-CCAM-9B-v1.1|Video-CCAM-7B-v1.2|
|:-:|:-:|:-:|:-:|
|MVBench (32 frames)|61.08|64.60|69.23|
|Video-MME (w/o sub, 96 frames)|49.4|50.3|53.0|
|Video-MME (w sub, 96 frames)|55.2|52.6|56.1|
|MLVU (M-Avg, 96 frames)|59.4|58.5|61.4|
|VideoVista (96 frames)|64.39|69.00|70.48|
## Acknowledgement
* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
* [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct): Excellent language models developed by Alibaba Cloud.
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.
## License
The project is licensed under the Apache 2.0 License and is restricted to uses that comply with the license agreements of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384).