UniMVU / README.md

Add unimvu_7B_avqa release checkpoint

25a9687 verified 18 days ago

4.92 kB

	---
	license: other
	library_name: peft
	base_model:
	- lmms-lab/llava-onevision-qwen2-0.5b-ov
	- lmms-lab/llava-onevision-qwen2-7b-ov
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	- video
	- audio
	- 3d
	- peft
	- lora
	- safetensors
	- llava-onevision
	- qwen2
	language:
	- en
	---

	# UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2

	Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of `lmms-lab/llava-onevision-qwen2-0.5b-ov` and `lmms-lab/llava-onevision-qwen2-7b-ov`.

	Unlike plain LoRA releases, UniMVU checkpoints also include `non_lora_trainables.bin` for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only `PeftModel.from_pretrained(...)` workflow.

	[arXiv](#)

	## Release Contents

	\| Folder \| Scale \| Type \| Task(s) \| Base model \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| `unimvu_0.5B_avqa` \| 0.5B \| Single-task \| AVQA \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| `unimvu_0.5B_avsd` \| 0.5B \| Single-task \| AVSD \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| `unimvu_0.5B_music_avqa` \| 0.5B \| Single-task \| Music-AVQA \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| `unimvu_0.5B_scanqa` \| 0.5B \| Single-task \| ScanQA \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| `unimvu_0.5B_sqa3d` \| 0.5B \| Single-task \| SQA3D \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| `unimvu_7B_avqa` \| 7B \| Single-task \| AVQA \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|
	\| `unimvu_7B_avsd` \| 7B \| Single-task \| AVSD \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|
	\| `unimvu_7B_music_avqa` \| 7B \| Single-task \| Music-AVQA \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|
	\| `unimvu_7B_scanqa` \| 7B \| Single-task \| ScanQA \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|
	\| `unimvu_7B_sqa3d` \| 7B \| Single-task \| SQA3D \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|
	\| `unimvu_uni_0.5B` \| 0.5B \| Unified \| Mixed multi-task release \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| `unimvu_uni_7B` \| 7B \| Unified \| Mixed multi-task release \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|

	The default upload manifest publishes only the final release files:

	- `adapter_config.json`
	- `adapter_model.safetensors`
	- `config.json`
	- `non_lora_trainables.bin`

	Intermediate `checkpoint-*` folders inside `unimvu_uni_0.5B` are training snapshots and are excluded from the default Hugging Face upload.

	## Requirements

	Use these checkpoints with the open-source [UniMVU GitHub repository](#) and install the dependencies from that repo:

	```bash
	git clone <UniMVU GitHub repo>
	cd UniMVU
	pip install -r requirements.txt
	pip install huggingface_hub peft
	```

	Download the checkpoint folder you need from this repository, then point the UniMVU evaluation scripts to it with `--model-path`.

	## Usage

	These checkpoints are intended to be used together with the [UniMVU GitHub repository](#).

	1. Clone the UniMVU repository and install its dependencies.
	2. Download the checkpoint subfolder you want from this Hugging Face repo.
	3. Set the downloaded folder as `--model-path` in the UniMVU evaluation scripts.
	4. Run the appropriate UniMVU evaluation entry point for your task.

	## Loader Mapping

	\| Release family \| `model_type` \| `model_arg_name` \| `model_base` \|
	\| --- \| --- \| --- \| --- \|
	\| Single-task 0.5B adapters \| `unimvu` \| `VideoFeatModelArgumentsUniMVU` \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| Single-task 7B adapters \| `unimvu` \| `VideoFeatModelArgumentsUniMVU_7B` \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|
	\| Unified 0.5B adapter \| `unimvu_uni` \| `VideoFeatModelArgumentsUniMVU_Uni` \| `lmms-lab/llava-onevision-qwen2-0.5b-ov` \|
	\| Unified 7B adapter \| `unimvu_uni` \| `VideoFeatModelArgumentsUniMVU_Uni_7B` \| `lmms-lab/llava-onevision-qwen2-7b-ov` \|

	## Evaluation Entry Points

	- Use `scripts/_eval_.sh` and `unified_eval.py` in the UniMVU repository for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
	- Use `lmms_eval_start.py` in the UniMVU repository for MVBench-style evaluation.

	## License

	The released adapters depend on third-party base models and should be used in compliance with the licenses of:

	- `lmms-lab/llava-onevision-qwen2-0.5b-ov`
	- `lmms-lab/llava-onevision-qwen2-7b-ov`

	Please also follow the usage terms of the downstream datasets and features used in evaluation.

	## Citation

	If you use UniMVU in your work, please cite:

	```bibtex
	@inproceedings{ding2026unimvu,
	title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos},
	author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
	year={2026}
	}
	```

	## Acknowledgements

	UniMVU builds on the open-source ecosystem around PAVE, Qwen2, LLaVA-OneVision, LMMS-Eval, PEFT, and Transformers.