| | ---
|
| | license: mit
|
| | ---
|
| |
|
| | ## Model Summary
|
| |
|
| | Video-CCAM-9B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.
|
| |
|
| | ## Usage
|
| |
|
| | Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
|
| | ```
|
| | pip install -U pip torch transformers peft decord pysubs2 imageio
|
| | ```
|
| |
|
| | ## Inference
|
| |
|
| | ```
|
| | import os
|
| | import torch
|
| | from PIL import Image
|
| | from transformers import AutoModel
|
| |
|
| | from eval import load_decord
|
| |
|
| | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
|
| |
|
| | videoccam = AutoModel.from_pretrained(
|
| | '<your_local_path_1>',
|
| | trust_remote_code=True,
|
| | torch_dtype=torch.bfloat16,
|
| | device_map='auto',
|
| | _attn_implementation='flash_attention_2',
|
| | # llm_name_or_path='<your_local_llm_path>',
|
| | # vision_encoder_name_or_path='<your_local_vision_encoder_path>'
|
| | )
|
| |
|
| |
|
| | messages = [
|
| | [
|
| | {
|
| | 'role': 'user',
|
| | 'content': '<image>\nDescribe this image in detail.'
|
| | }
|
| | ], [
|
| | {
|
| | 'role': 'user',
|
| | 'content': '<video>\nDescribe this video in detail.'
|
| | }
|
| | ]
|
| | ]
|
| |
|
| | images = [
|
| | Image.open('assets/example_image.jpg').convert('RGB'),
|
| | load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
|
| | ]
|
| |
|
| | response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)
|
| |
|
| | print(response)
|
| | ```
|
| |
|
| | Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.
|
| |
|
| | ### Benchmarks
|
| |
|
| | |Benchmark|Video-CCAM-9B|Video-CCAM-9B-v1.1|
|
| | |:-:|:-:|:-:|
|
| | |MVBench (32 frames)|61.08|64.60|
|
| | |MSVD-QA (32 frames)|76.9/4.1|77.9/4.2|
|
| | |MSRVTT-QA (32 frames)|58.7/3.5|65.9/3.8|
|
| | |ActivityNet-QA (32 frames)|56.2/3.6|58.7/3.8|
|
| | |TGIF-QA (32 frames)|83.9/4.4|84.0/4.5|
|
| | |Video-MME (w/o sub, 96 frames)|49.4|50.3|
|
| | |Video-MME (w sub, 96 frames)|55.2|52.6|
|
| | |MLVU (M-Avg, 96 frames)|59.4|58.5|
|
| | |MLVU (G-Avg, 96 frames)|3.91|3.98|
|
| | |VideoVista (96 frames)|64.39|69.00|
|
| |
|
| | * The accuracies and scores of MSVD-QA,MSRVTT-QA,ActivityNet-QA,TGIF-QA are evaluated by `gpt-3.5-turbo-0125`.
|
| |
|
| | ## Acknowledgement
|
| |
|
| | * [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works!
|
| | * [Yi-1.5-9B-Chat](https://huggingface.co/01-ai/Yi-1.5-9B-Chat): Great language models developed by [01.AI](https://www.lingyiwanwu.com/).
|
| | * [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.
|
| |
|
| | ## License
|
| | The model is licensed under the MIT license.
|
| |
|