LLaVA-OneVision is a multimodal vision-language model that integrates a pretrained Qwen-2 language model with a visual encoder, enabling instruction-tuned understanding and reasoning across text and images.

Original paper: LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision-Qwen2-7B

This model uses LLaVA-OneVision with Qwen-2 as the language backbone, allowing rich multimodal reasoning and generation capabilities. It is well suited for applications such as image-grounded question answering, multimodal dialogue, and tasks requiring aligned understanding of visual and textual information.

Model Configuration:

Reference implementation: LLaVA_OneVision
Original Weight: llava-onevision-qwen2-7b-ov-chat
Vision Encoder: SO400M
Language Model: Qwen-2.0
Resolution: 3x384x384
Support Cooper version:
- Cooper SDK: [2.5.3]
- Cooper Foundry: [2.2]

Model	Device	Model Link
LLaVA-OneVision	N1-655	Model_Link

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Ambarella/LLaVA-OneVision

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61