license: other
library_name: peft
base_model:
- lmms-lab/llava-onevision-qwen2-0.5b-ov
- lmms-lab/llava-onevision-qwen2-7b-ov
pipeline_tag: image-text-to-text
tags:
- multimodal
- video
- audio
- 3d
- peft
- lora
- safetensors
- llava-onevision
- qwen2
language:
- en
UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2
Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of lmms-lab/llava-onevision-qwen2-0.5b-ov and lmms-lab/llava-onevision-qwen2-7b-ov.
Unlike plain LoRA releases, UniMVU checkpoints also include non_lora_trainables.bin for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only PeftModel.from_pretrained(...) workflow.
Release Contents
| Folder | Scale | Type | Task(s) | Base model |
|---|---|---|---|---|
unimvu_0.5B_avqa |
0.5B | Single-task | AVQA | lmms-lab/llava-onevision-qwen2-0.5b-ov |
unimvu_0.5B_avsd |
0.5B | Single-task | AVSD | lmms-lab/llava-onevision-qwen2-0.5b-ov |
unimvu_0.5B_music_avqa |
0.5B | Single-task | Music-AVQA | lmms-lab/llava-onevision-qwen2-0.5b-ov |
unimvu_0.5B_scanqa |
0.5B | Single-task | ScanQA | lmms-lab/llava-onevision-qwen2-0.5b-ov |
unimvu_0.5B_sqa3d |
0.5B | Single-task | SQA3D | lmms-lab/llava-onevision-qwen2-0.5b-ov |
unimvu_7B_avqa |
7B | Single-task | AVQA | lmms-lab/llava-onevision-qwen2-7b-ov |
unimvu_7B_avsd |
7B | Single-task | AVSD | lmms-lab/llava-onevision-qwen2-7b-ov |
unimvu_7B_music_avqa |
7B | Single-task | Music-AVQA | lmms-lab/llava-onevision-qwen2-7b-ov |
unimvu_7B_scanqa |
7B | Single-task | ScanQA | lmms-lab/llava-onevision-qwen2-7b-ov |
unimvu_7B_sqa3d |
7B | Single-task | SQA3D | lmms-lab/llava-onevision-qwen2-7b-ov |
unimvu_uni_0.5B |
0.5B | Unified | Mixed multi-task release | lmms-lab/llava-onevision-qwen2-0.5b-ov |
unimvu_uni_7B |
7B | Unified | Mixed multi-task release | lmms-lab/llava-onevision-qwen2-7b-ov |
The default upload manifest publishes only the final release files:
adapter_config.jsonadapter_model.safetensorsconfig.jsonnon_lora_trainables.bin
Intermediate checkpoint-* folders inside unimvu_uni_0.5B are training snapshots and are excluded from the default Hugging Face upload.
Requirements
Use these checkpoints with the open-source UniMVU GitHub repository and install the dependencies from that repo:
git clone <UniMVU GitHub repo>
cd UniMVU
pip install -r requirements.txt
pip install huggingface_hub peft
Download the checkpoint folder you need from this repository, then point the UniMVU evaluation scripts to it with --model-path.
Usage
These checkpoints are intended to be used together with the UniMVU GitHub repository.
- Clone the UniMVU repository and install its dependencies.
- Download the checkpoint subfolder you want from this Hugging Face repo.
- Set the downloaded folder as
--model-pathin the UniMVU evaluation scripts. - Run the appropriate UniMVU evaluation entry point for your task.
Loader Mapping
| Release family | model_type |
model_arg_name |
model_base |
|---|---|---|---|
| Single-task 0.5B adapters | unimvu |
VideoFeatModelArgumentsUniMVU |
lmms-lab/llava-onevision-qwen2-0.5b-ov |
| Single-task 7B adapters | unimvu |
VideoFeatModelArgumentsUniMVU_7B |
lmms-lab/llava-onevision-qwen2-7b-ov |
| Unified 0.5B adapter | unimvu_uni |
VideoFeatModelArgumentsUniMVU_Uni |
lmms-lab/llava-onevision-qwen2-0.5b-ov |
| Unified 7B adapter | unimvu_uni |
VideoFeatModelArgumentsUniMVU_Uni_7B |
lmms-lab/llava-onevision-qwen2-7b-ov |
Evaluation Entry Points
- Use
scripts/*_eval_*.shandunified_eval.pyin the UniMVU repository for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D. - Use
lmms_eval_start.pyin the UniMVU repository for MVBench-style evaluation.
License
The released adapters depend on third-party base models and should be used in compliance with the licenses of:
lmms-lab/llava-onevision-qwen2-0.5b-ovlmms-lab/llava-onevision-qwen2-7b-ov
Please also follow the usage terms of the downstream datasets and features used in evaluation.
Citation
If you use UniMVU in your work, please cite:
@inproceedings{ding2026unimvu,
title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos},
author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
Acknowledgements
UniMVU builds on the open-source ecosystem around PAVE, Qwen2, LLaVA-OneVision, LMMS-Eval, PEFT, and Transformers.