BonanDing commited on
Commit
4ee4d25
·
verified ·
1 Parent(s): d9326ab

Add UniMVU model card and paper

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +141 -0
  3. UniMVU_CVPR_2026__Camera_Ready_.pdf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ UniMVU_CVPR_2026__Camera_Ready_.pdf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ library_name: peft
4
+ base_model:
5
+ - lmms-lab/llava-onevision-qwen2-0.5b-ov
6
+ - lmms-lab/llava-onevision-qwen2-7b-ov
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - multimodal
10
+ - video
11
+ - audio
12
+ - 3d
13
+ - peft
14
+ - lora
15
+ - safetensors
16
+ - llava-onevision
17
+ - qwen2
18
+ language:
19
+ - en
20
+ ---
21
+
22
+ # UniMVU - LoRA Adapters for LLaVA-OneVision Qwen2
23
+
24
+ Open-source UniMVU release checkpoints for instruction-aware multimodal video understanding. This release covers audio-video QA, 3D QA, and unified multi-task adapters built on top of `lmms-lab/llava-onevision-qwen2-0.5b-ov` and `lmms-lab/llava-onevision-qwen2-7b-ov`.
25
+
26
+ Unlike plain LoRA releases, UniMVU checkpoints also include `non_lora_trainables.bin` for the extra modality-gating modules. Use the UniMVU loader instead of a PEFT-only `PeftModel.from_pretrained(...)` workflow.
27
+
28
+ [Paper PDF](./UniMVU_CVPR_2026__Camera_Ready_.pdf)
29
+
30
+ ## Highlights
31
+
32
+ - Instruction-aware gating across video, audio, depth, and long-video evidence.
33
+ - Single-task adapters for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
34
+ - Unified multi-task adapters for the mixed-training UniMVU release.
35
+ - Gains of up to +13.5 CIDEr on AVSD over the reproduced PAVE baseline, as reported in the paper.
36
+
37
+ ## Release Contents
38
+
39
+ | Folder | Scale | Type | Task(s) | Base model | Published size |
40
+ | --- | --- | --- | --- | --- | --- |
41
+ | `unimvu_0.5B_avqa` | 0.5B | Single-task | AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
42
+ | `unimvu_0.5B_avsd` | 0.5B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
43
+ | `unimvu_0.5B_music_avqa` | 0.5B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
44
+ | `unimvu_0.5B_scanqa` | 0.5B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
45
+ | `unimvu_0.5B_sqa3d` | 0.5B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 96.4 MB |
46
+ | `unimvu_7B_avsd` | 7B | Single-task | AVSD | `lmms-lab/llava-onevision-qwen2-7b-ov` | 715.9 MB |
47
+ | `unimvu_7B_music_avqa` | 7B | Single-task | Music-AVQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | 715.9 MB |
48
+ | `unimvu_7B_scanqa` | 7B | Single-task | ScanQA | `lmms-lab/llava-onevision-qwen2-7b-ov` | 1.04 GB |
49
+ | `unimvu_7B_sqa3d` | 7B | Single-task | SQA3D | `lmms-lab/llava-onevision-qwen2-7b-ov` | 1.04 GB |
50
+ | `unimvu_uni_0.5B` | 0.5B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-0.5b-ov` | 103.7 MB |
51
+ | `unimvu_uni_7B` | 7B | Unified | Mixed multi-task release | `lmms-lab/llava-onevision-qwen2-7b-ov` | 745.3 MB |
52
+
53
+ The default upload manifest publishes only the final release files:
54
+
55
+ - `adapter_config.json`
56
+ - `adapter_model.safetensors`
57
+ - `config.json`
58
+ - `non_lora_trainables.bin`
59
+
60
+ Intermediate `checkpoint-*` folders inside `unimvu_uni_0.5B` are training snapshots and are excluded from the default Hugging Face upload.
61
+
62
+ ## Requirements
63
+
64
+ Use these adapters with the open-source UniMVU codebase and its dependencies:
65
+
66
+ ```bash
67
+ pip install -r requirements.txt
68
+ pip install huggingface_hub peft
69
+ ```
70
+
71
+ If you only need one adapter, prefer `snapshot_download(...)` so you do not fetch the entire release repo.
72
+
73
+ ## Quick Start
74
+
75
+ The example below downloads one subfolder from this repo and loads it through UniMVU's own evaluation loader, which merges the LoRA adapter and then restores `non_lora_trainables.bin`.
76
+
77
+ ```python
78
+ import os
79
+
80
+ from huggingface_hub import snapshot_download
81
+
82
+ from unified_eval import load_trained_model_for_eval
83
+
84
+ REPO_ID = "BonanDing/UniMVU"
85
+ SUBFOLDER = "unimvu_uni_7B"
86
+
87
+ local_root = snapshot_download(
88
+ repo_id=REPO_ID,
89
+ allow_patterns=[f"{SUBFOLDER}/*"],
90
+ )
91
+ model_path = os.path.join(local_root, SUBFOLDER)
92
+
93
+ tokenizer, model, image_processor, context_len = load_trained_model_for_eval(
94
+ model_path=model_path,
95
+ model_base="lmms-lab/llava-onevision-qwen2-7b-ov",
96
+ model_arg_name="VideoFeatModelArgumentsUniMVU_Uni_7B",
97
+ model_type="unimvu_uni",
98
+ device="cuda",
99
+ )
100
+ model.eval()
101
+ ```
102
+
103
+ ## Loader Mapping
104
+
105
+ | Release family | `model_type` | `model_arg_name` | `model_base` |
106
+ | --- | --- | --- | --- |
107
+ | Single-task 0.5B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
108
+ | Single-task 7B adapters | `unimvu` | `VideoFeatModelArgumentsUniMVU_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` |
109
+ | Unified 0.5B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni` | `lmms-lab/llava-onevision-qwen2-0.5b-ov` |
110
+ | Unified 7B adapter | `unimvu_uni` | `VideoFeatModelArgumentsUniMVU_Uni_7B` | `lmms-lab/llava-onevision-qwen2-7b-ov` |
111
+
112
+ ## Evaluation Entry Points
113
+
114
+ - Use `unified_eval.py` for AVQA, AVSD, Music-AVQA, ScanQA, and SQA3D.
115
+ - Use `lmms_eval_start.py` for MVBench-style evaluation in the UniMVU codebase.
116
+
117
+ ## License
118
+
119
+ The released adapters depend on third-party base models and should be used in compliance with the licenses of:
120
+
121
+ - `lmms-lab/llava-onevision-qwen2-0.5b-ov`
122
+ - `lmms-lab/llava-onevision-qwen2-7b-ov`
123
+
124
+ Please also follow the usage terms of the downstream datasets and features used in evaluation.
125
+
126
+ ## Citation
127
+
128
+ If you use UniMVU in your work, please cite:
129
+
130
+ ```bibtex
131
+ @inproceedings{ding2026unimvu,
132
+ title={Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos},
133
+ author={Ding, Bonan and Nawaz, Umair and Khan, Ufaq and Shaker, Abdelrahman M. and Khan, Muhammad Haris and Cao, Jiale and Xie, Jin and Khan, Fahad Shahbaz},
134
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
135
+ year={2026}
136
+ }
137
+ ```
138
+
139
+ ## Acknowledgements
140
+
141
+ UniMVU builds on the open-source multimodal ecosystem around LLaVA-style training utilities, LMMS-Eval, PEFT, and Transformers.
UniMVU_CVPR_2026__Camera_Ready_.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c88438d085ac9f0c5f27e0dc5154dec69c03253b67c0553d8b55386c572fd7c9
3
+ size 60618683