| license: mit | |
| datasets: | |
| - liuhaotian/LLaVA-Instruct-150K | |
| - liuhaotian/LLaVA-Pretrain | |
| language: | |
| - en | |
| pipeline_tag: visual-question-answering | |
| # Model Card for Model ID | |
| This is a multimodal implementation of [Phi2](https://huggingface.co/microsoft/phi-2) model inspired by [LlaVA-Phi](https://github.com/zhuyiche/llava-phi). | |
| ## Model Details | |
| 1. LLM Backbone: [Phi2](https://huggingface.co/microsoft/phi-2) | |
| 2. Vision Tower: [clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336) | |
| 4. Pretraining Dataset: [LAION-CC-SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) | |
| 5. Finetuning Dataset: [Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | |
| 6. Finetuned Model: [marianna13/llava-phi-2-3b](https://huggingface.co/marianna13/llava-phi-2-3b) | |
| ### Model Sources | |
| <!-- Provide the basic links for the model. --> | |
| - **Original Repository:** [Llava-Phi](https://github.com/zhuyiche/llava-phi) | |
| - **Paper [optional]:** [LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model](https://arxiv.org/pdf/2401.02330) | |
| - **Demo [optional]:** [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal-Phi2) | |