--- license: apache-2.0 --- ## Model Summary Video-CCAM-4B-v1.2 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team, built upon [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384). Compared to previous versions, it has better performances on public benchmarks and supports Chinese response. ## Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10. ``` pip install -U pip torch transformers accelerate peft decord pysubs2 imageio # flash attention support pip install flash-attn --no-build-isolation ``` ## Inference ``` import os import torch from huggingface_hub import snapshot_download from PIL import Image from transformers import AutoModel from eval import load_decord os.environ['TOKENIZERS_PARALLELISM'] = 'false' # if you have downloaded this model, just replace the following line with your local path model_path = snapshot_download(repo_id='JaronTHU/Video-CCAM-4B-v1.2') videoccam = AutoModel.from_pretrained( model_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='cuda:0', attn_implementation='flash_attention_2' ) tokenizer = AutoTokenizer.from_pretrained(model_path) image_processor = AutoImageProcessor.from_pretrained(model_path) messages = [ [ { 'role': 'user', 'content': '\nDescribe this image in detail.' } ], [ { 'role': 'user', 'content': '