|
|
--- |
|
|
license: other |
|
|
license_name: nvidia |
|
|
license_link: LICENSE |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
## Model Overview |
|
|
|
|
|
### Description: |
|
|
|
|
|
VideoMAE model used for training AutoGaze. This model is for research and development only. <br> |
|
|
|
|
|
### License/Terms of Use: |
|
|
|
|
|
NVIDIA license (see LICENSE.md). The reference to the NVIDIA License means the attached custom NSCLv1 license, under which users may use for purposes of conducting non-commercial research activities and non-commercial research publications. |
|
|
|
|
|
### Deployment Geography: |
|
|
Global <br> |
|
|
|
|
|
### Use Case: <br> |
|
|
The model is used to reconstruct videos from partially observed patches. <br> |
|
|
|
|
|
## Quick Start: |
|
|
|
|
|
See [our GitHub repo](https://github.com/NVlabs/AutoGaze) for instructions on how to use it to AutoGaze. |
|
|
|
|
|
## References(s): |
|
|
GitHub: https://github.com/NVlabs/AutoGaze <br> |
|
|
|
|
|
## Model Architecture: |
|
|
**Architecture Type:** Transformer. <br> |
|
|
|
|
|
**Network Architecture:** Custom Architecture. <br> |
|
|
|
|
|
**Number of model parameters:** 300M <br> |
|
|
|
|
|
|
|
|
## Input(s): <br> |
|
|
**Input Type(s):** Video <br> |
|
|
|
|
|
**Input Format(s):** Video: .mp4/.webm/.mov./etc. <br> |
|
|
|
|
|
**Input Parameters:** Video: Three-Dimensional (3D) |
|
|
|
|
|
**Other Properties Related to Input:** Video with 16 frames, 224x224 resolution. |
|
|
|
|
|
|
|
|
## Output(s) |
|
|
|
|
|
**Output Type(s):** Video <br> |
|
|
|
|
|
**Output Format(s):** Video: .mp4/.webm/.mov./etc. <br> |
|
|
|
|
|
**Output Parameters:** Video: Three-Dimensional (3D) |
|
|
|
|
|
**Other Properties Related to Outupt:** Video with 16 frames, 224x224 resolution. |
|
|
|
|
|
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br> |
|
|
|
|
|
## Software Integration: |
|
|
**Runtime Engine(s):** N/A |
|
|
|
|
|
**Supported Hardware Microarchitecture Compatibility:** <br> |
|
|
NVIDIA Ampere <br> |
|
|
NVIDIA Blackwell <br> |
|
|
NVIDIA Jetson <br> |
|
|
NVIDIA Hopper <br> |
|
|
|
|
|
**Preferred/Supported Operating System(s):** |
|
|
Linux <br> |
|
|
Linux 4 Tegra <br> |
|
|
QNX <br> |
|
|
Windows <br> |
|
|
|
|
|
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br> |
|
|
|
|
|
## Model Version(s): |
|
|
v1.0 - Initial release <br> |
|
|
|
|
|
## Training, Testing, and Evaluation Datasets: |
|
|
|
|
|
### Dataset Overview |
|
|
** Total Size: 800K <br> |
|
|
** Total Number of Datasets: 1 <br> |
|
|
|
|
|
** Dataset partition: Training [97%], Testing [3%] <br> |
|
|
** Time period for training data collection 2025/5-2025/8 <br> |
|
|
** Time period for testing data collection 2025/5-2025/8 <br> |
|
|
|
|
|
The data is constructed by collecting raw videos from existing video datasets and labeling gazing sequences for a subset of it using a greedy-search algorithm. <br> |
|
|
|
|
|
## Public Datasets |
|
|
The raw videos are collected from public dataset including Ego4D, 100DoH, InternVid, SA-1B, and IDL. <br> |
|
|
|
|
|
## Training Dataset [The dataset the model was trained on]: |
|
|
|
|
|
**Data Modality:** Video <br> |
|
|
|
|
|
**Video Training Data Size:** 800K videos <br> |
|
|
|
|
|
**Data Collection Method by dataset:** Automated <br> |
|
|
|
|
|
**Labeling Method by dataset:** Automated <br> |
|
|
|
|
|
**Properties:** 800K videos with 224*224 resolution and 16 frames each video. <br> |
|
|
|
|
|
## Testing Dataset: |
|
|
|
|
|
**Data Collection Method by dataset:** Automated <br> |
|
|
|
|
|
**Labeling Method by dataset:** Automated <br> |
|
|
|
|
|
**Properties:** 25K videos with 224*224 resolution and 16 frames each video. <br> |
|
|
|
|
|
|
|
|
## Inference: |
|
|
**Acceleration Engine:** N/A <br> |
|
|
**Test Hardware:** A100 <br> |
|
|
|
|
|
|
|
|
## Ethical Considerations: |
|
|
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). <br> |
|
|
|
|
|
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. <br> |
|
|
|