nvidia
/

NVILA-8B-HD-Video

Model card Files Files and versions

Update README.md

#1

by bfshi - opened Mar 12

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ tags:
 NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.
-Specifically, NVILA-HD-Video uses [AutoGaze](nvidia/AutoGaze) to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
 This model is for research and development only.

 NVILA-HD-Video is a Multi-modal Large Language Model with 8B parameters that understands and answers questions about videos with up to 4K resolution and 1K frames.
+Specifically, NVILA-HD-Video uses [AutoGaze](https://huggingface.co/nvidia/AutoGaze) to reduce redundant patches in a video before running the ViT or LLM. Empirically, AutoGaze can reduce #tokens in in a video by up to 100x, reducing the latency of ViT/LLM by up to 19x/10x. This enables NVILA-HD-Video to efficiently scale to 4K-resolution, 1K-frame videos and achieve improved performance on benchmarks such as VideoMME and state-of-the-art performance on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
 This model is for research and development only.