| --- |
| license: other |
| library_name: peft |
| base_model: |
| - lmms-lab/llava-onevision-qwen2-0.5b-ov |
| - lmms-lab/llava-onevision-qwen2-7b-ov |
| pipeline_tag: image-text-to-text |
| tags: |
| - multimodal |
| - video |
| - audio |
| - 3d |
| - peft |
| - lora |
| - safetensors |
| - llava-onevision |
| - qwen2 |
| language: |
| - en |
| --- |
| |
| # UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2 |
|
|
| Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of `lmms-lab/llava-onevision-qwen2-0.5b-ov` and `lmms-lab/llava-onevision-qwen2-7b-ov`. |
|
|
| Unlike plain LoRA releases, UniMVU checkpoints also include `non_lora_trainables.bin` for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only `PeftModel.from_pretrained(...)` workflow. |
|
|
| [arXiv](#) |
|
|
| ## Release Contents |
|
|
| | Folder | Scale | Type | Task(s) | Base model | |
| | --- | --- | --- | --- | --- | |
| | `unimvu_0.5B_avqa` | 0.5B | Single-task | AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | `unimvu_0.5B_avsd` | 0.5B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | `unimvu_0.5B_music_avqa` | 0.5B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | `unimvu_0.5B_scanqa` | 0.5B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | `unimvu_0.5B_sqa3d` | 0.5B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | `unimvu_7B_avqa` | 7B | Single-task | AVQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
| | `unimvu_7B_avsd` | 7B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
| | `unimvu_7B_music_avqa` | 7B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
| | `unimvu_7B_scanqa` | 7B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
| | `unimvu_7B_sqa3d` | 7B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
| | `unimvu_uni_0.5B` | 0.5B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | `unimvu_uni_7B` | 7B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
|
|
| The default upload manifest publishes only the final release files: |
|
|
| - `adapter_config.json` |
| - `adapter_model.safetensors` |
| - `config.json` |
| - `non_lora_trainables.bin` |
|
|
| Intermediate `checkpoint-*` folders inside `unimvu_uni_0.5B` are training snapshots and are excluded from the default Hugging Face upload. |
|
|
| ## Requirements |
|
|
| Use these checkpoints with the open-source [UniMVU GitHub repository](#) and install the dependencies from that repo: |
|
|
| ```bash |
| git clone <UniMVU GitHub repo> |
| cd UniMVU |
| pip install -r requirements.txt |
| pip install huggingface_hub peft |
| ``` |
|
|
| Download the checkpoint folder you need from this repository, then point the UniMVU evaluation scripts to it with `--model-path`. |
|
|
| ## Usage |
|
|
| These checkpoints are intended to be used together with the [UniMVU GitHub repository](#). |
|
|
| 1. Clone the UniMVU repository and install its dependencies. |
| 2. Download the checkpoint subfolder you want from this Hugging Face repo. |
| 3. Set the downloaded folder as `--model-path` in the UniMVU evaluation scripts. |
| 4. Run the appropriate UniMVU evaluation entry point for your task. |
|
|
| ## Loader Mapping |
|
|
| | Release family | `model_type` | `model_arg_name` | `model_base` | |
| | --- | --- | --- | --- | |
| | Single-task 0.5B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | Single-task 7B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
| | Unified 0.5B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | |
| | Unified 7B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` | |
|
|
| ## Evaluation Entry Points |
|
|
| - Use `scripts/*_eval_*.sh` and `unified_eval.py` in the UniMVU repository for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D. |
| - Use `lmms_eval_start.py` in the UniMVU repository for MVBench-style evaluation. |
|
|
| ## License |
|
|
| The released adapters depend on third-party base models and should be used in compliance with the licenses of: |
|
|
| - `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| - `lmms-lab/llava-onevision-qwen2-7b-ov` |
|
|
| Please also follow the usage terms of the downstream datasets and features used in evaluation. |
|
|
| ## Citation |
|
|
| If you use UniMVU in your work, please cite: |
|
|
| ```bibtex |
| @inproceedings{ding2026unimvu, |
| title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos}, |
| author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz}, |
| booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| UniMVU builds on the open-source ecosystem around PAVE, Qwen2, LLaVA-OneVision, LMMS-Eval, PEFT, and Transformers. |
|
|