File size: 7,519 Bytes
73b433d
 
 
 
 
 
 
8e5869d
73b433d
8e5869d
73b433d
 
 
a682f95
73b433d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a5670e
 
73b433d
 
7a5670e
73b433d
 
 
7a5670e
73b433d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: cc-by-nc-4.0
tags:
- AutoGaze
- NVILA
---

# NVILA-8B-HD-Video

[Project Page](https://autogaze.github.io/) | [Paper](https://huggingface.co/papers/2603.12254) | [GitHub](https://github.com/NVlabs/AutoGaze) | [Models & Data & Benchmark](https://huggingface.co/collections/bfshi/autogaze) | [Demo](https://huggingface.co/spaces/bfshi/AutoGaze)

NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.

Specifically, NVILA-HD-Video uses [AutoGaze](https://huggingface.co/nvidia/AutoGaze) to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.

This model is for research and development only.

### Quick Start:

Note: please first install [AutoGaze](https://github.com/NVlabs/AutoGaze).

```python
import torch
from transformers import AutoModel, AutoProcessor

model_path = "nvidia/NVILA-8B-HD-Video"
video_path = "https://huggingface.co/datasets/bfshi/HLVid/resolve/main/example/clip_av_video_5_001.mp4"
prompt = "Question: What does the white text on the green road sign say?\n \
A. Hampden St\n \
B. Hampden Ave\n \
C. HampdenBlvd\n \
D. Hampden Rd\n \
Please answer directly with the letter of the correct answer."

# ----- Video processing args -----
num_video_frames = 128           # Total sampled frames for tiles
num_video_frames_thumbnail = 64 # Total sampled frames for thumbnails
max_tiles_video = 48             # Max spatial tiles per video (one tile is 392x392)

# ----- AutoGaze args (tiles) -----
gazing_ratio_tile = [0.2] + [0.06] * 15  # Per-frame max gazing ratios (single float or list). Videos with higher resolution/FPS usually need lower gazing ratio.
task_loss_requirement_tile = 0.6         # AutoGaze stops gazing at each frame when the estimated reconstruction loss of that frame is lower than this threshold.

# ----- AutoGaze args (thumbnails) -----
gazing_ratio_thumbnail = 1       # Set gazing ratio to 1 and task loss requirement to None to skip gazing on thumbnails
task_loss_requirement_thumbnail = None

# ----- Batching -----
max_batch_size_autogaze = 16     # Set AutoGaze and SigLIP to use smaller mini-batch size if GPU memory is limited
max_batch_size_siglip = 32

# Load processor and model
processor = AutoProcessor.from_pretrained(
    model_path,
    num_video_frames=num_video_frames,
    num_video_frames_thumbnail=num_video_frames_thumbnail,
    max_tiles_video=max_tiles_video,
    gazing_ratio_tile=gazing_ratio_tile,
    gazing_ratio_thumbnail=gazing_ratio_thumbnail,
    task_loss_requirement_tile=task_loss_requirement_tile,
    task_loss_requirement_thumbnail=task_loss_requirement_thumbnail,
    max_batch_size_autogaze=max_batch_size_autogaze,
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto",
    max_batch_size_siglip=max_batch_size_siglip,
)
model.eval()

# Run inference
video_token = processor.tokenizer.video_token
inputs = processor(text=f"{video_token}\n\n{prompt}", videos=video_path, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

outputs = model.generate(**inputs)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()
print(response)
```

For more details, see the [VILA github repo](https://github.com/NVlabs/VILA/tree/main/vila_hd/nvila_hd_video).

### License/Terms of Use: <br> 

Governing Terms:  [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). Additional Information:  [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/) for [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

### Deployment Geography:

Global

### Use Case: <br>

The model is used for understanding high-resolution long-form videos.

## Reference(s):

AutoGaze GitHub: https://github.com/NVlabs/AutoGaze <br> 

## Model Architecture:
**Architecture Type:** Neural Network

**Network Architecture:** Multi-modal Large Language Model

**Number of model parameters:** 8B <br>

**This model was developed based on [AutoGaze](https://huggingface.co/nvidia/AutoGaze) and [NVILA-Lite-8B](https://huggingface.co/Efficient-Large-Model/NVILA-Lite-8B) <br> 

## Input: <br>
**Input Type(s):** Video and Text <br>
**Input Format:** Red, Green, Blue (RGB) and strings <br>
**Input Parameters:** Three Dimensional (3D) and One Dimensional (1D) <br>
**Other Properties Related to Input:** Videos with resolution up to 4K and #frames up to 1K and text input up to 20K tokens <br>

## Output: <br>
**Output Type(s):** Text <br>
**Output Format:** Strings <br>
**Output Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Output:** Text output up to 20K tokens <br>
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br> 

## Software Integration:
**Runtime Engine(s):** 
Not Applicable (N/A) <br> 

**Supported Hardware Microarchitecture Compatibility:** <br>
NVIDIA Ampere <br>
NVIDIA Blackwell <br>
NVIDIA Hopper <br>
NVIDIA Jetson  <br>

**Preferred/Supported Operating System(s):** <br>
Linux <br>
Linux 4 Tegra <br>
QNX  <br>
Windows <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>

## Model Version(s): 
v1.0 - Initial release

## Training Datasets: <br>   

72 datasets. See NVILA paper for more details.

Dataset partition: Training 100% <br>

## Training Dataset:

**Link:**
See NVILA paper for more details.

**Data Collection Method by dataset:**  <br>
[Hybrid: Automated, Human]

**Labeling Method by dataset:**  <br>
[Hybrid: Automated, Human]

**Properties (Quantity, Dataset Descriptions, Sensor(s)):**  <br>
72 datasets split into 5 stages (Projector Alignment, Vision Encoder Alignment, Pre-Training, Image Instruction-Tuning, and Patch Selection Tuning) <br>




## Inference:
**Acceleration Engine:** N/A <br>
**Test Hardware:** <br>
The model is tested on NVIDIA A100 GPU.



### Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).