---
license: other
library_name: transformers
---
# **OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**
[](https://arxiv.org/abs/2510.15870)
[](https://github.com/NVlabs/OmniVinci)
[](https://huggingface.co/nvidia/omnivinci)
[](https://nvlabs.github.io/OmniVinci)
## Introduction
OmniVinci is an NVIDIA research project focused on exploring omni-modal LLMs that can not only see and read but also listen, speak, and reason.
We are among the best omni-modality understanding models. Check out our performance on some of the most popular omni-modality, audio, and vision benchmarks:
## Quickstart Below, we provide simple examples to show how to use our model with Transformers. ### Environment Setup 1. Download and navigate to the HuggingFace repository: ``` huggingface-cli download nvidia/omnivinci --local-dir ./omnivinci --local-dir-use-symlinks False cd ./omnivinci ``` 2. Install Python environment (based on NVILA codebase): ``` bash ./environment_setup.sh omnivinci ``` ### 🤗 Transformers Usage #### Video (with Audio) Inference Example ```python from transformers import AutoProcessor, AutoModel, AutoConfig,AutoModelForCausalLM import torch import os # default: Load the model on the available device(s) model_path = "./" video_path = "xxx.mp4" generation_kwargs = {"max_new_tokens": 1024, "max_length": 99999999} load_audio_in_video = True num_video_frames = 128 audio_length = "max_3600" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype="torch.float16", device_map="auto") processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) generation_config = model.default_generation_config generation_config.update(**generation_kwargs) model.config.load_audio_in_video = load_audio_in_video processor.config.load_audio_in_video = load_audio_in_video if num_video_frames > 0: model.config.num_video_frames = num_video_frames processor.config.num_video_frames = num_video_frames if audio_length != -1: model.config.audio_chunk_length = audio_length processor.config.audio_chunk_length = audio_length conversation = [{ "role": "user", "content": [ {"type": "video", "video":video_path}, {"type": "text", "text": "Assess the video, followed by a detailed description of its video and audio contents."} ] }] text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) inputs = processor([text]) output_ids = model.generate( input_ids=inputs.input_ids, media=getattr(inputs, 'media', None), media_config=getattr(inputs, 'media_config', None), generation_config=generation_config, ) print(processor.tokenizer.batch_decode(output_ids, skip_special_tokens=True)) ``` - **For audio and image inference examples, please refer to `example_mini_audio.py` and `example_mini_image.py`.** ## License / Terms of Use The model is released under the [NVIDIA OneWay Noncommercial License](asset/NVIDIA_OneWay_Noncommercial_License.docx). ## Citation Please consider to cite our paper and this framework, if they are helpful in your research. ```bibtex @article{ye2025omnivinci, title={OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM}, author={Ye, Hanrong and Yang, Chao-Han Huck and Goel, Arushi and Huang, Wei and Zhu, Ligeng and Su, Yuanhang and Lin, Sean and Cheng, An-Chieh and Wan, Zhen and Tian, Jinchuan and others}, journal={arXiv preprint arXiv:2510.15870}, year={2025} } ```