JaronTHU
/

Video-CCAM-4B-v1.2

Model card Files Files and versions

Video-CCAM-4B-v1.2 / README.md

jaronfei

add README.md

3c41ddf over 1 year ago

|

history blame contribute delete

3.13 kB

	---
	license: apache-2.0
	---

	## Model Summary

	Video-CCAM-4B-v1.2 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team, built upon [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384). Compared to previous versions, it has better performances on public benchmarks and supports Chinese response.

	## Usage

	Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
	```
	pip install -U pip torch transformers accelerate peft decord pysubs2 imageio
	# flash attention support
	pip install flash-attn --no-build-isolation
	```

	## Inference

	```
	import os
	import torch
	from huggingface_hub import snapshot_download
	from PIL import Image
	from transformers import AutoModel

	from eval import load_decord

	os.environ['TOKENIZERS_PARALLELISM'] = 'false'

	# if you have downloaded this model, just replace the following line with your local path
	model_path = snapshot_download(repo_id='JaronTHU/Video-CCAM-4B-v1.2')

	videoccam = AutoModel.from_pretrained(
	model_path,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map='cuda:0',
	attn_implementation='flash_attention_2'
	)

	tokenizer = AutoTokenizer.from_pretrained(model_path)

	image_processor = AutoImageProcessor.from_pretrained(model_path)

	messages = [
	[
	{
	'role': 'user',
	'content': '<image>\nDescribe this image in detail.'
	}
	], [
	{
	'role': 'user',
	'content': '<video>\n请仔细描述这个视频。'
	}
	]
	]

	images = [
	[Image.open('assets/example_image.jpg').convert('RGB')],
	load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
	]

	response = videoccam.chat(messages, images, tokenizer, image_processor, max_new_tokens=512, do_sample=False)

	print(response)
	```

	Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.

	### Benchmarks

	\|Benchmark\|Video-CCAM-4B\|Video-CCAM-4B-v1.1\|Video-CCAM-4B-v1.2\|
	\|:-:\|:-:\|:-:\|:-:\|
	\|MVBench (32 frames)\|57.43\|62.80\|66.28\|
	\|Video-MME (w/o sub, 96 frames)\|49.7\|50.1\|51.5\|
	\|Video-MME (w sub, 96 frames)\|52.8\|51.2\|54.5\|
	\|MLVU (M-Avg, 96 frames)\|57.3\|56.5\|61.0\|
	\|VideoVista (96 frames)\|68.09\|70.82\|73.44\|

	## Acknowledgement

	* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
	* [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct): Powerful language models developed by Microsoft.
	* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.

	## License

	The project is licensed under the Apache 2.0 License and is restricted to uses that comply with the license agreements of [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384).