File size: 6,193 Bytes
c48c32c
 
7b3777b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c48c32c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
library_name: transformers
license: apache-2.0
tags:
- omni-modal
- multimodal
- vision
- audio
- video
- llm
model-index:
- name: OmniVinci
  results:
  - task:
      type: image-to-text
      name: Image Understanding
    dataset:
      name: MVBench
      type: mvbench
    metrics:
    - name: MVBench Score
      type: accuracy
      value: 70.6
      source:
        name: OmniVinci Technical Report
        url: https://arxiv.org/abs/2510.15870
  - task:
      type: video-to-text
      name: Video Understanding
    dataset:
      name: Video-MME
      type: video-mme
    metrics:
    - name: Video-MME (w/o sub)
      type: accuracy
      value: 68.2
      source:
        name: OmniVinci Technical Report
        url: https://arxiv.org/abs/2510.15870
  - task:
      type: video-to-text
      name: Cross-Modal Understanding
    dataset:
      name: DailyOmni
      type: dailyomni
    metrics:
    - name: DailyOmni Score
      type: accuracy
      value: 66.5
      source:
        name: OmniVinci Technical Report
        url: https://arxiv.org/abs/2510.15870
  - task:
      type: audio-to-text
      name: Audio Understanding
    dataset:
      name: MMAR
      type: mmar
    metrics:
    - name: MMAR Score
      type: accuracy
      value: 58.4
      source:
        name: OmniVinci Technical Report
        url: https://arxiv.org/abs/2510.15870
  - task:
      type: audio-to-text
      name: Audio-Only Reasoning
    dataset:
      name: MMAU
      type: mmau
    metrics:
    - name: MMAU Score
      type: accuracy
      value: 71.6
      source:
        name: OmniVinci Technical Report
        url: https://arxiv.org/abs/2510.15870
  - task:
      type: video-to-text
      name: Multi-Modal Reasoning
    dataset:
      name: Worldsense
      type: worldsense
    metrics:
    - name: Worldsense Score
      type: accuracy
      value: 48.2
      source:
        name: OmniVinci Technical Report
        url: https://arxiv.org/abs/2510.15870
---
# <span style="background: linear-gradient(45deg, #667eea 0%, #764ba2 25%, #f093fb 50%, #f5576c 75%, #4facfe 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text; font-weight: bold; font-size: 1.1em;">**OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**</span> <br />

[![Paper](https://img.shields.io/badge/ArXiv-Paper-brown)](https://arxiv.org/abs/2510.15870)
[![Code](https://img.shields.io/badge/GitHub-Link-blue)](https://github.com/NVlabs/OmniVinci)
[![Model](https://img.shields.io/badge/HuggingFace-Model-yellow)](https://huggingface.co/nvidia/omnivinci)
[![Website](https://img.shields.io/badge/Web-Page-orange)](https://nvlabs.github.io/OmniVinci)


## Introduction
OmniVinci is an NVIDIA research project focused on exploring omni-modal LLMs that can not only see and read but also listen, speak, and reason.

We are among the best omni-modality understanding models. Check out our performance on some of the most popular omni-modality, audio, and vision benchmarks:
<p align="center">
    <img src="./asset/performance.png" width="80%"/>
<p>


## Quickstart

Below, we provide simple examples to show how to use our model with Transformers.

### Environment Setup

1. Download and navigate to the HuggingFace repository:
```
huggingface-cli download nvidia/omnivinci --local-dir ./omnivinci --local-dir-use-symlinks False
cd ./omnivinci
```

2. Install Python environment (based on NVILA codebase):
```
bash ./environment_setup.sh omnivinci
```

### 🤗 Transformers Usage

#### Video (with Audio) Inference Example
```python
from transformers import AutoProcessor, AutoModel, AutoConfig,AutoModelForCausalLM
import torch
import os

# default: Load the model on the available device(s)
model_path = "./"
video_path = "xxx.mp4"
generation_kwargs = {"max_new_tokens": 1024, "max_length": 99999999}
load_audio_in_video = True
num_video_frames = 128
audio_length = "max_3600"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_path,
                                  trust_remote_code=True,
                                  torch_dtype="torch.float16",
                                  device_map="auto")

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
generation_config = model.default_generation_config
generation_config.update(**generation_kwargs)

model.config.load_audio_in_video = load_audio_in_video
processor.config.load_audio_in_video = load_audio_in_video
if num_video_frames > 0:
    model.config.num_video_frames = num_video_frames
    processor.config.num_video_frames = num_video_frames
if audio_length != -1:
    model.config.audio_chunk_length = audio_length
    processor.config.audio_chunk_length = audio_length


conversation = [{
        "role": "user",
        "content": [
            {"type": "video", "video":video_path},
            {"type": "text", "text": "Assess the video, followed by a detailed description of its video and audio contents."}
        ]
}]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)

inputs = processor([text])

output_ids = model.generate(
    input_ids=inputs.input_ids,
    media=getattr(inputs, 'media', None),
    media_config=getattr(inputs, 'media_config', None),
    generation_config=generation_config,
)
print(processor.tokenizer.batch_decode(output_ids, skip_special_tokens=True))
```

- **For audio and image inference examples, please refer to `example_mini_audio.py` and `example_mini_image.py`.**


## License / Terms of Use
The model is released under the [NVIDIA OneWay Noncommercial License](asset/NVIDIA_OneWay_Noncommercial_License.docx).

## Citation
Please consider to cite our paper and this framework, if they are helpful in your research.

```bibtex
@article{ye2025omnivinci,
  title={OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM},
  author={Ye, Hanrong and Yang, Chao-Han Huck and Goel, Arushi and Huang, Wei and Zhu, Ligeng and Su, Yuanhang and Lin, Sean and Cheng, An-Chieh and Wan, Zhen and Tian, Jinchuan and others},
  journal={arXiv preprint arXiv:2510.15870},
  year={2025}
}
```