| --- | |
| library_name: pytorch | |
|  | |
| LLaVA-OneVision is a multimodal vision-language model that integrates a pretrained Qwen-2 language model with a visual encoder, enabling instruction-tuned understanding and reasoning across text and images. | |
| Original paper: [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326) | |
| # LLaVA-OneVision-Qwen2-7B | |
| This model uses LLaVA-OneVision with Qwen-2 as the language backbone, allowing rich multimodal reasoning and generation capabilities. It is well suited for applications such as image-grounded question answering, multimodal dialogue, and tasks requiring aligned understanding of visual and textual information. | |
| Model Configuration: | |
| - Reference implementation: [LLaVA_OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) | |
| - Original Weight: [llava-onevision-qwen2-7b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov-chat) | |
| - Vision Encoder: SO400M | |
| - Language Model: Qwen-2.0 | |
| - Resolution: 3x384x384 | |
| - Support Cooper version: | |
| - Cooper SDK: [2.5.2] | |
| - Cooper Foundry: [2.2] | |
| | Model | Device | Model Link | | |
| | :-----: | :-----: | :-----: | | |
| | LLaVA-OneVision | N1-655 | [Model_Link](https://huggingface.co/Ambarella/LLaVA-OneVision/blob/main/n1-655_llava_onevision_7B_1NVP.tar) | | |