File size: 7,519 Bytes

---
license: cc-by-nc-4.0
tags:
- AutoGaze
- NVILA
---

# NVILA-8B-HD-Video

[Project Page](https://autogaze.github.io/) | [Paper](https://huggingface.co/papers/2603.12254) | [GitHub](https://github.com/NVlabs/AutoGaze) | [Models & Data & Benchmark](https://huggingface.co/collections/bfshi/autogaze) | [Demo](https://huggingface.co/spaces/bfshi/AutoGaze)

NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.

Specifically, NVILA-HD-Video uses [AutoGaze](https://huggingface.co/nvidia/AutoGaze) to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.

This model is for research and development only.

### Quick Start:

Note: please first install [AutoGaze](https://github.com/NVlabs/AutoGaze).

```python
import torch
from transformers import AutoModel, AutoProcessor

model_path = "nvidia/NVILA-8B-HD-Video"
video_path = "https://huggingface.co/datasets/bfshi/HLVid/resolve/main/example/clip_av_video_5_001.mp4"
prompt = "Question: What does the white text on the green road sign say?\n \
A. Hampden St\n \
B. Hampden Ave\n \
C. HampdenBlvd\n \
D. Hampden Rd\n \
Please answer directly with the letter of the correct answer."

# ----- Video processing args -----
num_video_frames = 128           # Total sampled frames for tiles
num_video_frames_thumbnail = 64 # Total sampled frames for thumbnails
max_tiles_video = 48             # Max spatial tiles per video (one tile is 392x392)

# ----- AutoGaze args (tiles) -----
gazing_ratio_tile = [0.2] + [0.06] * 15  # Per-frame max gazing ratios (single float or list). Videos with higher resolution/FPS usually need lower gazing ratio.
task_loss_requirement_tile = 0.6         # AutoGaze stops gazing at each frame when the estimated reconstruction loss of that frame is lower than this threshold.

# ----- AutoGaze args (thumbnails) -----
gazing_ratio_thumbnail = 1       # Set gazing ratio to 1 and task loss requirement to None to skip gazing on thumbnails
task_loss_requirement_thumbnail = None

# ----- Batching -----
max_batch_size_autogaze = 16     # Set AutoGaze and SigLIP to use smaller mini-batch size if GPU memory is limited
max_batch_size_siglip = 32

# Load processor and model
processor = AutoProcessor.from_pretrained(
    model_path,
    num_video_frames=num_video_frames,
    num_video_frames_thumbnail=num_video_frames_thumbnail,
    max_tiles_video=max_tiles_video,
    gazing_ratio_tile=gazing_ratio_tile,
    gazing_ratio_thumbnail=gazing_ratio_thumbnail,
    task_loss_requirement_tile=task_loss_requirement_tile,
    task_loss_requirement_thumbnail=task_loss_requirement_thumbnail,
    max_batch_size_autogaze=max_batch_size_autogaze,
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto",
    max_batch_size_siglip=max_batch_size_siglip,
)
model.eval()

# Run inference
video_token = processor.tokenizer.video_token
inputs = processor(text=f"{video_token}\n\n{prompt}", videos=video_path, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

outputs = model.generate(**inputs)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()
print(response)
```

For more details, see the [VILA github repo](https://github.com/NVlabs/VILA/tree/main/vila_hd/nvila_hd_video).

### License/Terms of Use: <br> 

Governing Terms:  [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). Additional Information:  [Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/) for [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).

### Deployment Geography:

Global

### Use Case: <br>

The model is used for understanding high-resolution long-form videos.

## Reference(s):

AutoGaze GitHub: https://github.com/NVlabs/AutoGaze <br> 

## Model Architecture:
**Architecture Type:** Neural Network

**Network Architecture:** Multi-modal Large Language Model

**Number of model parameters:** 8B <br>

**This model was developed based on [AutoGaze](https://huggingface.co/nvidia/AutoGaze) and [NVILA-Lite-8B](https://huggingface.co/Efficient-Large-Model/NVILA-Lite-8B) <br> 

## Input: <br>
**Input Type(s):** Video and Text <br>
**Input Format:** Red, Green, Blue (RGB) and strings <br>
**Input Parameters:** Three Dimensional (3D) and One Dimensional (1D) <br>
**Other Properties Related to Input:** Videos with resolution up to 4K and #frames up to 1K and text input up to 20K tokens <br>

## Output: <br>
**Output Type(s):** Text <br>
**Output Format:** Strings <br>
**Output Parameters:** One Dimensional (1D) <br>
**Other Properties Related to Output:** Text output up to 20K tokens <br>
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br> 

## Software Integration:
**Runtime Engine(s):** 
Not Applicable (N/A) <br> 

**Supported Hardware Microarchitecture Compatibility:** <br>
NVIDIA Ampere <br>
NVIDIA Blackwell <br>
NVIDIA Hopper <br>
NVIDIA Jetson  <br>

**Preferred/Supported Operating System(s):** <br>
Linux <br>
Linux 4 Tegra <br>
QNX  <br>
Windows <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>

## Model Version(s): 
v1.0 - Initial release

## Training Datasets: <br>   

72 datasets. See NVILA paper for more details.

Dataset partition: Training 100% <br>

## Training Dataset:

**Link:**
See NVILA paper for more details.

**Data Collection Method by dataset:**  <br>
[Hybrid: Automated, Human]

**Labeling Method by dataset:**  <br>
[Hybrid: Automated, Human]

**Properties (Quantity, Dataset Descriptions, Sensor(s)):**  <br>
72 datasets split into 5 stages (Projector Alignment, Vision Encoder Alignment, Pre-Training, Image Instruction-Tuning, and Patch Selection Tuning) <br>




## Inference:
**Acceleration Engine:** N/A <br>
**Test Hardware:** <br>
The model is tested on NVIDIA A100 GPU.



### Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).