release

Files changed (5) hide show

LICENSE.md +35 -0
README.md +132 -0
config.json +289 -0
model.safetensors +3 -0
preprocessor_config.json +27 -0

LICENSE.md ADDED Viewed

	@@ -0,0 +1,35 @@

+NVIDIA License
+1. Definitions
+“Licensor” means any person or entity that distributes its Work.
+“Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works  thereof  that are made available under this license.
+The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
+Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
+2. License Grant
+2.1 Copyright Grant. Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
+3. Limitations
+3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
+3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
+3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for non-commercial research activities or non-commercial research publications only.
+3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
+3.5 Trademarks. This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
+3.6 Termination. If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
+4. Disclaimer of Warranty.
+THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.
+5. Limitation of Liability.
+EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

README.md ADDED Viewed

	@@ -0,0 +1,132 @@

+---
+license: other
+license_name: nvidia
+license_link: LICENSE
+---
+## Model Overview
+### Description:
+AutoGaze is a ultra light-weight model that automatically removes redundant patches in a video before passing it to any Vision Transformer (ViT) or Multi-modal Large Language Model (MLLM).
+Speficially, AutoGaze perceives each frame and autoregressively selects ("gazing") a minimal set of patches that can reconstruct the original video (i.e., non-redundant patches) up to a reconstruction loss threhold provided by the user. AutoGaze can self-decide when to stop gazing for each frame based on user's request on the acceptable maximum reconstruction loss.<br>
+Empircally, AutoGaze can reduce #tokens in ViTs/MLLMs by up to 100x, reducing their latency by up to 19x/10x. This enables efficiently scaling MLLMs to 4K-resolution, 1K-frame videos, improving performance on benchmarks such as VideoMME. Especially, it improves performance by 14% on HLVid, a high-resolution long-form video benchmark proposed in this work as well.
+This model is for research and development only.  <br>
+### Quick Start:
+See [our GitHub repo](https://github.com/NVlabs/AutoGaze/QUICK_START.md) for instructions on how to use AutoGaze.
+### License/Terms of Use:
+NVIDIA license (see LICENSE.md). The reference to the NVIDIA License means the attached custom NSCLv1 license, under which users may use for purposes of conducting non-commercial research activities and non-commercial research publications.
+### Deployment Geography:
+Global <br>
+### Use Case: <br>
+The model is used to remove redundancy in videos and accelerate video encoders and MLLMs. <br>
+## References(s):
+GitHub: https://github.com/NVlabs/AutoGaze <br>
+## Model Architecture:
+**Architecture Type:** CNN and Transformer. <br>
+**Network Architecture:** Custom Architecture. <br>
+**Number of model parameters:** 3M <br>
+## Input(s): <br>
+**Input Type(s):** Video <br>
+**Input Format(s):** Video: .mp4/.webm/.mov./etc. <br>
+**Input Parameters:** Video: Three-Dimensional (3D)
+**Other Properties Related to Input:** Video with any resolution or duration.
+## Output(s)
+**Output Type(s):** Integers (representing patch indices) <br>
+**Output Format(s):** Integers <br>
+**Output Parameters:** One-Dimensional (1D)
+**Other Properties Related to Outupt:** N/A
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
+## Software Integration:
+**Runtime Engine(s):** N/A
+**Supported Hardware Microarchitecture Compatibility:** <br>
+NVIDIA Ampere <br>
+NVIDIA Blackwell <br>
+NVIDIA Jetson <br>
+NVIDIA Hopper <br>
+**Preferred/Supported Operating System(s):**
+Linux <br>
+Linux 4 Tegra <br>
+QNX <br>
+Windows <br>
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>
+## Model Version(s):
+v1.0 - Initial release  <br>
+## Training, Testing, and Evaluation Datasets:
+### Dataset Overview
+** Total Size: 800K <br>
+** Total Number of Datasets: 1 <br>
+** Dataset partition: Training [97%], Testing [3%] <br>
+** Time period for training data collection 2025/5-2025/8 <br>
+** Time period for testing data collection 2025/5-2025/8 <br>
+The data is constructed by collecting raw videos from existing video datasets and labeling gazing sequences for a subset of it using a greedy-search algorithm. <br>
+## Public Datasets
+The raw videos are collected from public dataset including Ego4D, 100DoH, InternVid, SA-1B, and IDL. <br>
+## Training Dataset [The dataset the model was trained on]:
+**Data Modality:** Video <br>
+**Video Training Data Size:** 800K videos <br>
+**Data Collection Method by dataset:** Automated <br>
+**Labeling Method by dataset:** Automated <br>
+**Properties:** 800K videos with 224*224 resolution and 16 frames each video. <br>
+## Testing Dataset:
+**Data Collection Method by dataset:** Automated <br>
+**Labeling Method by dataset:** Automated <br>
+**Properties:** 25K videos with 224*224 resolution and 16 frames each video. <br>
+## Inference:
+**Acceleration Engine:** N/A <br>
+**Test Hardware:** A100 <br>
+## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br>
+Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br>

config.json ADDED Viewed

	@@ -0,0 +1,289 @@

+{
+  "attn_mode": "sdpa",
+  "gaze_model_config": {
+    "connector_config": {
+      "_attn_implementation_autoset": false,
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_dim": 192,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "",
+      "no_repeat_ngram_size": 0,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_return_sequences": 1,
+      "num_tokens": 196,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "torch_dtype": null,
+      "torchscript": false,
+      "typical_p": 1.0,
+      "use_bfloat16": false
+    },
+    "gaze_decoder_config": {
+      "_attn_implementation_autoset": false,
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_bias": false,
+      "attention_dropout": 0.0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": 12800000,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": 265,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "head_dim": 32,
+      "hidden_act": "silu",
+      "hidden_size": 192,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "intermediate_size": 384,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "max_position_embeddings": 131072,
+      "min_length": 0,
+      "mlp_bias": false,
+      "model_type": "llama",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 6,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_hidden_layers": 4,
+      "num_key_value_heads": 6,
+      "num_multi_token_pred": 10,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "pretraining_tp": 1,
+      "problem_type": null,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "rms_norm_eps": 1e-05,
+      "rope_scaling": {
+        "factor": 8.0,
+        "high_freq_factor": 4.0,
+        "low_freq_factor": 1.0,
+        "original_max_position_embeddings": 8192,
+        "rope_type": "llama3"
+      },
+      "rope_theta": 500000.0,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": false,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "torch_dtype": null,
+      "torchscript": false,
+      "typical_p": 1.0,
+      "use_bfloat16": false,
+      "use_cache": true,
+      "vocab_size": 266
+    },
+    "input_img_size": 224,
+    "model_type": "",
+    "num_vision_tokens_each_frame": 265,
+    "vision_model_config": {
+      "_attn_implementation_autoset": false,
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "depth": 1,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_dim": 192,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "kernel_size": 16,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "",
+      "no_repeat_ngram_size": 0,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_return_sequences": 1,
+      "out_dim": 192,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "temporal_patch_size": 1,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "torch_dtype": null,
+      "torchscript": false,
+      "trunk_spatial_kernel_size": 3,
+      "trunk_temporal_kernel_size": 3,
+      "typical_p": 1.0,
+      "use_bfloat16": false
+    }
+  },
+  "gazing_ratio_config": {
+    "exponential": {
+      "gazing_ratio_max": 1,
+      "gazing_ratio_min": 0,
+      "lambda": 10
+    },
+    "fixed": {
+      "gazing_ratio": 0.75
+    },
+    "sample_strategy_during_inference": "fixed",
+    "sample_strategy_during_training": "fixed",
+    "uniform": {
+      "gazing_ratio_max": 1,
+      "gazing_ratio_min": 0
+    }
+  },
+  "gazing_ratio_each_frame_config": {
+    "dirichlet": {
+      "alpha": 0.5
+    },
+    "sample_strategy_during_inference": "uniform",
+    "sample_strategy_during_training": "self",
+    "self": null,
+    "uniform": null
+  },
+  "has_task_loss_requirement_during_inference": true,
+  "has_task_loss_requirement_during_training": false,
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "max_batch_size": null,
+  "max_num_frames": 16,
+  "model_type": "autogaze",
+  "num_vision_tokens_each_frame": 265,
+  "scales": "32+64+112+224",
+  "task_loss_requirement_config": {
+    "fixed": {
+      "task_loss_requirement": 0.7
+    },
+    "sample_strategy_during_inference": "fixed",
+    "sample_strategy_during_training": "uniform",
+    "uniform": {
+      "task_loss_requirement_max": 1.0,
+      "task_loss_requirement_min": 0.5
+    }
+  },
+  "transformers_version": "4.51.3",
+  "use_flash_attn": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a48e6a83a198368e3798420ff5d5df42af7c0003c9230f85581c39a2ea64e9eb
+size 13039216

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "crop_size": {
+    "height": 224,
+    "width": 224
+  },
+  "do_center_crop": false,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.485,
+    0.456,
+    0.406
+  ],
+  "image_processor_type": "AutoGazeImageProcessor",
+  "image_std": [
+    0.229,
+    0.224,
+    0.225
+  ],
+  "offset": true,
+  "resample": 2,
+  "rescale_factor": 0.00784313725490196,
+  "size": {
+    "shortest_edge": 224
+  }
+}