UniMVU / README.md

Add unimvu_7B_avqa release checkpoint

25a9687 verified 18 days ago

4.92 kB

license: other
library_name: peft
base_model:
  - lmms-lab/llava-onevision-qwen2-0.5b-ov
  - lmms-lab/llava-onevision-qwen2-7b-ov
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - video
  - audio
  - 3d
  - peft
  - lora
  - safetensors
  - llava-onevision
  - qwen2
language:
  - en

UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2

Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of lmms-lab/llava-onevision-qwen2-0.5b-ov and lmms-lab/llava-onevision-qwen2-7b-ov.

Unlike plain LoRA releases, UniMVU checkpoints also include non_lora_trainables.bin for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only PeftModel.from_pretrained(...) workflow.

arXiv

Release Contents

Folder	Scale	Type	Task(s)	Base model
`unimvu_0.5B_avqa`	0.5B	Single-task	AVQA	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
`unimvu_0.5B_avsd`	0.5B	Single-task	AVSD	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
`unimvu_0.5B_music_avqa`	0.5B	Single-task	Music-AVQA	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
`unimvu_0.5B_scanqa`	0.5B	Single-task	ScanQA	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
`unimvu_0.5B_sqa3d`	0.5B	Single-task	SQA3D	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
`unimvu_7B_avqa`	7B	Single-task	AVQA	`lmms-lab/llava-onevision-qwen2-7b-ov`
`unimvu_7B_avsd`	7B	Single-task	AVSD	`lmms-lab/llava-onevision-qwen2-7b-ov`
`unimvu_7B_music_avqa`	7B	Single-task	Music-AVQA	`lmms-lab/llava-onevision-qwen2-7b-ov`
`unimvu_7B_scanqa`	7B	Single-task	ScanQA	`lmms-lab/llava-onevision-qwen2-7b-ov`
`unimvu_7B_sqa3d`	7B	Single-task	SQA3D	`lmms-lab/llava-onevision-qwen2-7b-ov`
`unimvu_uni_0.5B`	0.5B	Unified	Mixed multi-task release	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
`unimvu_uni_7B`	7B	Unified	Mixed multi-task release	`lmms-lab/llava-onevision-qwen2-7b-ov`

The default upload manifest publishes only the final release files:

adapter_config.json
adapter_model.safetensors
config.json
non_lora_trainables.bin

Intermediate checkpoint-* folders inside unimvu_uni_0.5B are training snapshots and are excluded from the default Hugging Face upload.

Requirements

Use these checkpoints with the open-source UniMVU GitHub repository and install the dependencies from that repo:

git clone <UniMVU GitHub repo>
cd UniMVU
pip install -r requirements.txt
pip install huggingface_hub peft

Download the checkpoint folder you need from this repository, then point the UniMVU evaluation scripts to it with --model-path.

Usage

These checkpoints are intended to be used together with the UniMVU GitHub repository.

Clone the UniMVU repository and install its dependencies.
Download the checkpoint subfolder you want from this Hugging Face repo.
Set the downloaded folder as --model-path in the UniMVU evaluation scripts.
Run the appropriate UniMVU evaluation entry point for your task.

Loader Mapping

Release family	`model_type`	`model_arg_name`	`model_base`
Single-task 0.5B adapters	`unimvu`	`VideoFeatModelArgumentsUniMVU`	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
Single-task 7B adapters	`unimvu`	`VideoFeatModelArgumentsUniMVU_7B`	`lmms-lab/llava-onevision-qwen2-7b-ov`
Unified 0.5B adapter	`unimvu_uni`	`VideoFeatModelArgumentsUniMVU_Uni`	`lmms-lab/llava-onevision-qwen2-0.5b-ov`
Unified 7B adapter	`unimvu_uni`	`VideoFeatModelArgumentsUniMVU_Uni_7B`	`lmms-lab/llava-onevision-qwen2-7b-ov`

Evaluation Entry Points

Use scripts/*_eval_*.sh and unified_eval.py in the UniMVU repository for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
Use lmms_eval_start.py in the UniMVU repository for MVBench-style evaluation.

License

The released adapters depend on third-party base models and should be used in compliance with the licenses of:

lmms-lab/llava-onevision-qwen2-0.5b-ov
lmms-lab/llava-onevision-qwen2-7b-ov

Please also follow the usage terms of the downstream datasets and features used in evaluation.

Citation

If you use UniMVU in your work, please cite:

@inproceedings{ding2026unimvu,
  title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos},
  author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}

Acknowledgements

UniMVU builds on the open-source ecosystem around PAVE, Qwen2, LLaVA-OneVision, LMMS-Eval, PEFT, and Transformers.