UniMVU / README.md
BonanDing's picture
Add unimvu_7B_avqa release checkpoint
25a9687 verified
---
license: other
library_name: peft
base_model:
- lmms-lab/llava-onevision-qwen2-0.5b-ov
- lmms-lab/llava-onevision-qwen2-7b-ov
pipeline_tag: image-text-to-text
tags:
- multimodal
- video
- audio
- 3d
- peft
- lora
- safetensors
- llava-onevision
- qwen2
language:
- en
---
# UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2
Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of `lmms-lab/llava-onevision-qwen2-0.5b-ov` and `lmms-lab/llava-onevision-qwen2-7b-ov`.
Unlike plain LoRA releases, UniMVU checkpoints also include `non_lora_trainables.bin` for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only `PeftModel.from_pretrained(...)` workflow.
[arXiv](#)
## Release Contents
| Folder | Scale | Type | Task(s) | Base model |
| --- | --- | --- | --- | --- |
| `unimvu_0.5B_avqa` | 0.5B | Single-task | AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| `unimvu_0.5B_avsd` | 0.5B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| `unimvu_0.5B_music_avqa` | 0.5B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| `unimvu_0.5B_scanqa` | 0.5B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| `unimvu_0.5B_sqa3d` | 0.5B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| `unimvu_7B_avqa` | 7B | Single-task | AVQA | `lmms-lab/llava-onevision-qwen2-7b-ov` |
| `unimvu_7B_avsd` | 7B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-7b-ov` |
| `unimvu_7B_music_avqa` | 7B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-7b-ov` |
| `unimvu_7B_scanqa` | 7B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-7b-ov` |
| `unimvu_7B_sqa3d` | 7B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-7b-ov` |
| `unimvu_uni_0.5B` | 0.5B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| `unimvu_uni_7B` | 7B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-7b-ov` |
The default upload manifest publishes only the final release files:
- `adapter_config.json`
- `adapter_model.safetensors`
- `config.json`
- `non_lora_trainables.bin`
Intermediate `checkpoint-*` folders inside `unimvu_uni_0.5B` are training snapshots and are excluded from the default Hugging Face upload.
## Requirements
Use these checkpoints with the open-source [UniMVU GitHub repository](#) and install the dependencies from that repo:
```bash
git clone <UniMVU GitHub repo>
cd UniMVU
pip install -r requirements.txt
pip install huggingface_hub peft
```
Download the checkpoint folder you need from this repository, then point the UniMVU evaluation scripts to it with `--model-path`.
## Usage
These checkpoints are intended to be used together with the [UniMVU GitHub repository](#).
1. Clone the UniMVU repository and install its dependencies.
2. Download the checkpoint subfolder you want from this Hugging Face repo.
3. Set the downloaded folder as `--model-path` in the UniMVU evaluation scripts.
4. Run the appropriate UniMVU evaluation entry point for your task.
## Loader Mapping
| Release family | `model_type` | `model_arg_name` | `model_base` |
| --- | --- | --- | --- |
| Single-task 0.5B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| Single-task 7B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` |
| Unified 0.5B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
| Unified 7B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` |
## Evaluation Entry Points
- Use `scripts/*_eval_*.sh` and `unified_eval.py` in the UniMVU repository for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
- Use `lmms_eval_start.py` in the UniMVU repository for MVBench-style evaluation.
## License
The released adapters depend on third-party base models and should be used in compliance with the licenses of:
- `lmms-lab/llava-onevision-qwen2-0.5b-ov`
- `lmms-lab/llava-onevision-qwen2-7b-ov`
Please also follow the usage terms of the downstream datasets and features used in evaluation.
## Citation
If you use UniMVU in your work, please cite:
```bibtex
@inproceedings{ding2026unimvu,
title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos},
author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}
```
## Acknowledgements
UniMVU builds on the open-source ecosystem around PAVE, Qwen2, LLaVA-OneVision, LMMS-Eval, PEFT, and Transformers.