Add UniMVU model card and paper

Browse files

Files changed (3) hide show

.gitattributes +1 -0
README.md +141 -0
UniMVU_CVPR_2026__Camera_Ready_.pdf +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+UniMVU_CVPR_2026__Camera_Ready_.pdf filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,141 @@

+---
+license: other
+library_name: peft
+base_model:
+  - lmms-lab/llava-onevision-qwen2-0.5b-ov
+  - lmms-lab/llava-onevision-qwen2-7b-ov
+pipeline_tag: image-text-to-text
+tags:
+  - multimodal
+  - video
+  - audio
+  - 3d
+  - peft
+  - lora
+  - safetensors
+  - llava-onevision
+  - qwen2
+language:
+  - en
+---
+# UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2
+Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of `lmms-lab/llava-onevision-qwen2-0.5b-ov` and `lmms-lab/llava-onevision-qwen2-7b-ov`.
+Unlike plain LoRA releases, UniMVU checkpoints also include `non_lora_trainables.bin` for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only `PeftModel.from_pretrained(...)` workflow.
+[Paper PDF](./UniMVU_CVPR_2026__Camera_Ready_.pdf)
+## Highlights
+- Instruction-aware gating across video, audio, depth, and long-video evidence.
+- Single-task adapters for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
+- Unified multi-task adapters for the mixed-training UniMVU release.
+- Gains of up to +13.5 CIDEr on AVSD over the reproduced PAVE baseline, as reported in the paper.
+## Release Contents
+| Folder | Scale | Type | Task(s) | Base model | Published size |
+| --- | --- | --- | --- | --- | --- |
+| `unimvu_0.5B_avqa` | 0.5B | Single-task | AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
+| `unimvu_0.5B_avsd` | 0.5B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
+| `unimvu_0.5B_music_avqa` | 0.5B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
+| `unimvu_0.5B_scanqa` | 0.5B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
+| `unimvu_0.5B_sqa3d` | 0.5B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
+| `unimvu_7B_avsd` | 7B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-7b-ov` | 715.9 MB |
+| `unimvu_7B_music_avqa` | 7B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | 715.9 MB |
+| `unimvu_7B_scanqa` | 7B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | 1.04 GB |
+| `unimvu_7B_sqa3d` | 7B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-7b-ov` | 1.04 GB |
+| `unimvu_uni_0.5B` | 0.5B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 103.7 MB |
+| `unimvu_uni_7B` | 7B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-7b-ov` | 745.3 MB |
+The default upload manifest publishes only the final release files:
+- `adapter_config.json`
+- `adapter_model.safetensors`
+- `config.json`
+- `non_lora_trainables.bin`
+Intermediate `checkpoint-*` folders inside `unimvu_uni_0.5B` are training snapshots and are excluded from the default Hugging Face upload.
+## Requirements
+Use these adapters with the open-source UniMVU codebase and its dependencies:
+```bash
+pip install -r requirements.txt
+pip install huggingface_hub peft
+```
+If you only need one adapter, prefer `snapshot_download(...)` so you do not fetch the entire release repo.
+## Quick Start
+The example below downloads one subfolder from this repo and loads it through UniMVU's own evaluation loader, which merges the LoRA adapter and then restores `non_lora_trainables.bin`.
+```python
+import os
+from huggingface_hub import snapshot_download
+from unified_eval import load_trained_model_for_eval
+REPO_ID = "BonanDing/UniMVU"
+SUBFOLDER = "unimvu_uni_7B"
+local_root = snapshot_download(
+    repo_id=REPO_ID,
+    allow_patterns=[f"{SUBFOLDER}/*"],
+)
+model_path = os.path.join(local_root, SUBFOLDER)
+tokenizer, model, image_processor, context_len = load_trained_model_for_eval(
+    model_path=model_path,
+    model_base="lmms-lab/llava-onevision-qwen2-7b-ov",
+    model_arg_name="VideoFeatModelArgumentsUniMVU_Uni_7B",
+    model_type="unimvu_uni",
+    device="cuda",
+)
+model.eval()
+```
+## Loader Mapping
+| Release family | `model_type` | `model_arg_name` | `model_base` |
+| --- | --- | --- | --- |
+| Single-task 0.5B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
+| Single-task 7B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` |
+| Unified 0.5B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
+| Unified 7B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` |
+## Evaluation Entry Points
+- Use `unified_eval.py` for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
+- Use `lmms_eval_start.py` for MVBench-style evaluation in the UniMVU codebase.
+## License
+The released adapters depend on third-party base models and should be used in compliance with the licenses of:
+- `lmms-lab/llava-onevision-qwen2-0.5b-ov`
+- `lmms-lab/llava-onevision-qwen2-7b-ov`
+Please also follow the usage terms of the downstream datasets and features used in evaluation.
+## Citation
+If you use UniMVU in your work, please cite:
+```bibtex
+@inproceedings{ding2026unimvu,
+  title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos},
+  author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year={2026}
+}
+```
+## Acknowledgements
+UniMVU builds on the open-source multimodal ecosystem around LLaVA-style training utilities, LMMS-Eval, PEFT, and Transformers.

UniMVU_CVPR_2026__Camera_Ready_.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c88438d085ac9f0c5f27e0dc5154dec69c03253b67c0553d8b55386c572fd7c9
+size 60618683