How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/4D-RGPT-8B", dtype="auto")
Quick Links

Model Overview

Description:

4D-RGPT is a specialized multimodal large language model that improves region-level 4D (i.e., 3D + time) video understanding by distilling latent and explicit 4D perceptual signals (for example, depth and optical flow) from a frozen expert model into an NVILA-based student model. 4D-RGPT was developed by NVIDIA as part of the NVILA visual-language model family and introduces Perceptual 4D Distillation (P4D), Timestamp Positional Encoding (TPE), and the companion R4D-Bench benchmark for region-level 4D VQA.

This model is for research and development only.

License/Terms of Use:

Use of this model is governed by the CC-BY-NC-4.0 License.

Deployment Geography:

Global

Use Case:

Expected users are multimodal AI researchers, applied research teams, and developers studying video understanding, region grounding, 3D/4D reasoning, and physical AI. Representative use cases include region-level video question answering, model benchmarking, research on depth-and-time-aware MLLMs, and prototyping for domains such as robotics, autonomous driving, and industrial inspection.

Release Date:

Hugging Face [06/01/2026] via https://huggingface.co/nvidia/4D-RGPT-8B.

References(s):

Model Architecture:

Architecture Type: Transformer

Network Architecture: NVILA-Lite-based MLLM using a SigLIP vision encoder, multimodal projector, and language model.

This model was developed based on: NVILA-Lite-based MLLM

Number of model parameters: 8.0*10^9 for 4D-RGPT-8B

Describe design choices related to initialization techniques, hyperparameter tuning, regularization techniques, model optimization, damping, and training parameters: 4D-RGPT adds a lightweight training-only MLP 4D perception decoder (hidden size 2,560) with GELU activations, Xavier weight initialization, and zero bias initialization. Training begins from pretrained NVILA weights.Ttotal loss combines SFT, latent distillation, and explicit distillation with Timestamp Positional Encoding uses T=10,000.

Input(s):

Input Type(s): Image, Text, Video

Input Format(s):

  • Image: RGB
  • Text: String
  • Video: .mp4

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)
  • Video: Three-Dimensional (3D)

Other Properties Related to Input: The model is designed for video-question answering with explicit temporal cues encoded through timestamps for sampled frames. The paper uses sampled frame timestamps for TPE and, for fair comparison on R4D-Bench, evaluates open-source models using 16 sampled frames. Region-level evaluation uses region prompts represented through Set-of-Marks (SoM) or region masks in benchmark workflows.

Output(s)

Output Type(s): Text

Output Format(s):

  • Text: String

Output Parameters:

  • Text: One-Dimensional (1D)

Other Properties Related to Output: Outputs are text answers for 3D/4D VQA tasks, commonly multiple-choice selections, short phrases, or short numeric answers. The paper focuses on accuracy benchmarks rather than production API formatting. This model is designed to run on NVIDIA GPU-accelerated systems; the public training setup uses NVIDIA A100-SXM4-80GB GPUs.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • Not Applicable (N/A)- inference using NVILA

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere — A100-SXM4-80GB

Supported Operating System(s):

  • [Linux]

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

  • 4D-RGPT-8B — main paper model based on NVILA-Lite-8B; this is the primary reported configuration in the main results tables.

Training, Testing, and Evaluation Datasets:

Dataset Overview

Total Size: Approximately 3.8e5 supervision examples / QA pairs / conversations across the disclosed training mixture, based on the paper-reported counts. This corresponds to approximately 2.06e5 unique visual items (about 190k images plus about 16.2k videos).
Total Number of Datasets: 4 training datasets.

General description of data processing: The training mixture comprises VSTI-Bench training data, the NuScenes portion of Wolf, RoboFAC, and SAT. For evaluation, this release reports results on the companion R4D-Bench benchmark, VLM4D-real, and VSTI-Bench.

Public Datasets

Training datasets:

  • VSTI-Bench (training split): ~1.2k unique videos and ~130k QA pairs. Source videos are from ScanNet and ScanNet++.
  • Wolf (NuScenes portion): ~5k unique videos and ~15k QA pairs derived from dense captions.
  • RoboFAC: ~10k unique videos and ~65k conversations; simulated robotic-arm videos.
  • SAT (training split): ~190k unique simulated images and ~170k QA pairs.

Evaluation datasets:

  • R4D-Bench
  • VLM4D-real
  • VSTI-Bench

Training Dataset:

Data Modality:

  • [Image]
  • [Text]
  • [Video]

Training Data Size:

Approx. 3.8e5 supervision examples / QA pairs / conversations across the disclosed training mixture.

Image Training Data Size

  • Less than a Million Images

Text Training Data Size

  • Less than a Billion Tokens

Video Training Data Size

  • Less than 10,000 Hours

Data Collection Method by dataset

  • Hybrid: Automatic/Sensors, Human, Synthetic

Labeling Method by dataset

  • Hybrid: Human, Automated, Synthetic

Properties: The training data are English-language multimodal supervision examples spanning indoor scenes, autonomous driving, robotics, and simulated 3D/4D reasoning tasks. The mixture includes real-world videos, simulated videos, simulated images, and model-generated QA from captions.

Evaluation Dataset:

Data Collection Method by dataset:

  • Hybrid: Automatic/Sensors, Human, Synthetic

Labeling Method by dataset:

  • Hybrid: Human, Automated, Synthetic

Properties: Public evaluation is reported on R4D-Bench, VLM4D-real, and VSTI-Bench. They are testing-only and disjoint from the training data. R4D-Bench is region-prompted and is curated from STI-Bench and VLM4D through keyword extraction, segmentation, Set-of-Marks prompting, automated matching, and human verification.

Quantitative Evaluation Benchmarks

4D-RGPT-8B:

  • R4D-Bench: 46.2
  • VLM4D-real: 53.8
  • VSTI-Bench: 59.8

Inference:

Acceleration Engine: N/A
Test Hardware:

  • NVIDIA A100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/4D-RGPT-8B

Finetuned
(1)
this model

Dataset used to train nvidia/4D-RGPT-8B

Paper for nvidia/4D-RGPT-8B