JaronTHU
/

Video-CCAM-4B

Model card Files Files and versions

Video-CCAM-4B / README.md

jaronfei

fix typos

1444aeb over 1 year ago

|

history blame contribute delete

1.41 kB

	---
	license: mit
	---

	## Model Summary

	Video-CCAM-4B is a lightweight Video-MLLM built on [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) and [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384). Note: Here [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) refers to the previous version, which requires `git commit id ff07dc01615f8113924aed013115ab2abd32115b` to get the checkpoint.

	## Usage

	Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10：
	```
	torch==2.1.0
	torchvision==0.16.0
	transformers==4.40.2
	peft==0.10.0
	```

	## Inference & Evaluation

	Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) on inference and evaluation.

	### Video-MME

	\|#Frames.\|32\|96\|
	\|:-:\|:-:\|:-:\|
	\|w/o subs\|48.2\|49.6\|
	\|w subs\|51.7\|53.0\|

	### MVBench: 57.78 (16 frames)

	## Acknowledgement

	* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-4B is trained using the xtuner framework. Thanks for their excellent works!
	* [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct): Powerful language models developed by Microsoft.
	* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.

	## License
	The model is licensed under the MIT license.