LLaVA-OneVision is a multimodal vision-language model that integrates a pretrained Qwen-2 language model with a visual encoder, enabling instruction-tuned understanding and reasoning across text and images.
Original paper: LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision-Qwen2-7B
This model uses LLaVA-OneVision with Qwen-2 as the language backbone, allowing rich multimodal reasoning and generation capabilities. It is well suited for applications such as image-grounded question answering, multimodal dialogue, and tasks requiring aligned understanding of visual and textual information.
Model Configuration:
- Reference implementation: LLaVA_OneVision
- Original Weight: llava-onevision-qwen2-7b-ov-chat
- Vision Encoder: SO400M
- Language Model: Qwen-2.0
- Resolution: 3x384x384
- Support Cooper version:
- Cooper SDK: [2.5.2]
- Cooper Foundry: [2.2]
| Model | Device | Model Link |
|---|---|---|
| LLaVA-OneVision | N1-655 | Model_Link |
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
