Ambarella
/

LLaVA-OneVision

Model card Files Files and versions

LLaVA-OneVision / README.md

cooper_robot

Add release note for v1.1.0

48eeeff 3 days ago

|

history blame contribute delete

1.32 kB

	---
	library_name: pytorch
	---

	![llava_onevison_logo](resource/LLaVA_onevision.png)

	LLaVA-OneVision is a multimodal vision-language model that integrates a pretrained Qwen-2 language model with a visual encoder, enabling instruction-tuned understanding and reasoning across text and images.

	Original paper: [LLaVA-OneVision: Easy Visual Task Transfer](https://arxiv.org/abs/2408.03326)

	# LLaVA-OneVision-Qwen2-7B

	This model uses LLaVA-OneVision with Qwen-2 as the language backbone, allowing rich multimodal reasoning and generation capabilities. It is well suited for applications such as image-grounded question answering, multimodal dialogue, and tasks requiring aligned understanding of visual and textual information.

	Model Configuration:
	- Reference implementation: [LLaVA_OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT)
	- Original Weight: [llava-onevision-qwen2-7b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov-chat)
	- Vision Encoder: SO400M
	- Language Model: Qwen-2.0
	- Resolution: 3x384x384
	- Support Cooper version:
	- Cooper SDK: [2.5.2]
	- Cooper Foundry: [2.2]

	\| Model \| Device \| Model Link \|
	\| :-----: \| :-----: \| :-----: \|
	\| LLaVA-OneVision \| N1-655 \| [Model_Link](https://huggingface.co/Ambarella/LLaVA-OneVision/blob/main/n1-655_llava_onevision_7B_1NVP.tar) \|