llava_onevison_logo

LLaVA-OneVision is a multimodal vision-language model that integrates a pretrained Qwen-2 language model with a visual encoder, enabling instruction-tuned understanding and reasoning across text and images.

Original paper: LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision-Qwen2-7B

This model uses LLaVA-OneVision with Qwen-2 as the language backbone, allowing rich multimodal reasoning and generation capabilities. It is well suited for applications such as image-grounded question answering, multimodal dialogue, and tasks requiring aligned understanding of visual and textual information.

Model Configuration:

Model Device Model Link
LLaVA-OneVision N1-655 Model_Link
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support