Model Overview
Description:
NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.
Specifically, NVILA-HD-Video uses AutoGaze to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
This model is for research and development only.
Quick Start:
Note: please first install AutoGaze.
import torch
from transformers import AutoModel, AutoProcessor
model_path = "nvidia/NVILA-8B-HD-Video"
video_path = "https://huggingface.co/datasets/bfshi/HLVid/resolve/main/example/clip_av_video_5_001.mp4"
prompt = "Question: What does the white text on the green road sign say?\n \
A. Hampden St\n \
B. Hampden Ave\n \
C. HampdenBlvd\n \
D. Hampden Rd\n \
Please answer directly with the letter of the correct answer."
# ----- Video processing args -----
num_video_frames = 128 # Total sampled frames for tiles
num_video_frames_thumbnail = 64 # Total sampled frames for thumbnails
max_tiles_video = 48 # Max spatial tiles per video (one tile is 392x392)
# ----- AutoGaze args (tiles) -----
gazing_ratio_tile = [0.2] + [0.06] * 15 # Per-frame max gazing ratios (single float or list)
task_loss_requirement_tile = 0.6
# ----- AutoGaze args (thumbnails) -----
gazing_ratio_thumbnail = 1 # Set to None to skip gazing on thumbnails
task_loss_requirement_thumbnail = None
# ----- Batching -----
max_batch_size_autogaze = 16
max_batch_size_siglip = 32
# Load processor and model
processor = AutoProcessor.from_pretrained(
model_path,
num_video_frames=num_video_frames,
num_video_frames_thumbnail=num_video_frames_thumbnail,
max_tiles_video=max_tiles_video,
gazing_ratio_tile=gazing_ratio_tile,
gazing_ratio_thumbnail=gazing_ratio_thumbnail,
task_loss_requirement_tile=task_loss_requirement_tile,
task_loss_requirement_thumbnail=task_loss_requirement_thumbnail,
max_batch_size_autogaze=max_batch_size_autogaze,
trust_remote_code=True,
)
model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto",
max_batch_size_siglip=max_batch_size_siglip,
)
model.eval()
# Run inference
video_token = processor.tokenizer.video_token
inputs = processor(text=f"{video_token}\n\n{prompt}", videos=video_path, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
outputs = model.generate(**inputs)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()
print(response)
For more details, see the VILA github repo.
License/Terms of Use:
Governing Terms: CC-BY-NC-SA-4.0. Additional Information: Apache License 2.0 for Qwen2-VL-7B-Instruct.
Deployment Geography:
Global
Use Case:
The model is used for understanding high-resolution long-form videos.
Reference(s):
AutoGaze GitHub: https://github.com/NVlabs/AutoGaze
Model Architecture:
Architecture Type: Neural Network
Network Architecture: Multi-modal Large Language Model
Number of model parameters: 8B
**This model was developed based on AutoGaze and NVILA-Lite-8B
Input:
Input Type(s): Video and Text
Input Format: Red, Green, Blue (RGB) and strings
Input Parameters: Three Dimensional (3D) and One Dimensional (1D)
Other Properties Related to Input: Videos with resolution up to 4K and #frames up to 1K and text input up to 20K tokens
Output:
Output Type(s): Text
Output Format: Strings
Output Parameters: One Dimensional (1D)
Other Properties Related to Output: Text output up to 20K tokens
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
Not Applicable (N/A)
Supported Hardware Microarchitecture Compatibility:
NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Jetson
Preferred/Supported Operating System(s):
Linux
Linux 4 Tegra
QNX
Windows
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
v1.0 - Initial release
Training Datasets:
72 datasets. See NVILA paper for more details.
Dataset partition: Training 100%
Training Dataset:
Link: See NVILA paper for more details.
Data Collection Method by dataset:
[Hybrid: Automated, Human]
Labeling Method by dataset:
[Hybrid: Automated, Human]
Properties (Quantity, Dataset Descriptions, Sensor(s)):
72 datasets split into 5 stages (Projector Alignment, Vision Encoder Alignment, Pre-Training, Image Instruction-Tuning, and Patch Selection Tuning)
Inference:
Acceleration Engine: N/A
Test Hardware:
The model is tested on NVIDIA A100 GPU.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 85