You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

NVIDIA OmniDreams: Real-Time Generative Closed-Loop Autonomous Vehicle Simulation

Code | Website

Model Overview

Description:

NVIDIA OmniDreams generates photorealistic camera observations for autonomous driving simulation conditioned on simulator state and driving actions.

OmniDreams is an action-conditioned generative world model developed by NVIDIA that synthesizes multi-camera sensor observations in response to policy actions within a closed-loop simulation environment. The model operates autoregressively and maintains temporal context to support long-horizon rollouts while remaining responsive to changes in driving actions. It is designed to enable interactive testing of autonomous vehicle policies in photorealistic generative environments.

OmniDreams was developed by NVIDIA as part of the Cosmos world foundation model family and integrated into a closed-loop workflow with Alpamayo policies and the AlpaSim simulation runtime.

This model is ready for commercial/non-commercial use.

A 20-second OmniDreams rollout — generated front-view camera output.

License/Terms of Use:

Governing Download Terms: Use of the model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography:

Global

Use Case:

OmniDreams is intended for researchers and developers working on autonomous vehicle simulation and machine learning systems.

Release Date:

Github 06/02/2026 via public repo https://github.com/nv-tlabs/omni-dreams
Hugging Face 06/02/2026 via public repo https://huggingface.co/nvidia/omni-dreams-models

Model Architecture:

Architecture Type: Transformer

Network Architecture: Diffusion Transformer

  • Built on NVIDIA Cosmos. This model was developed based on NVIDIA Cosmos-Predict2.5.
  • Number of model parameters: 2B.

Inputs:

Input Types: Image, Video world state, Text

Input Formats:

  • Image: Red, Green, Blue (RGB). First camera frame.
  • Video: Red, Green, Blue (RGB) image frames. Lane lines and bounding box representation of the world scenario.
  • Text: String. Natural language prompt describing environmental conditions.

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)
  • Video: Three-Dimensional (3D)

Other Properties Related to Input: The input resolution is 704 x 1280.

Output

Output Type: Video

Output Format:

  • Video: MP4

Output Parameters:

  • Video: Three-Dimensional (3D)

Other Properties Related to Output: The output resolution is 704 x 1280.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine: Not Applicable - Uses Python Scripts and Pytorch

Supported Hardware Microarchitecture Compatibility:
NVIDIA Blackwell (B300, RTX 6000 Pro), NVIDIA Hopper (H100)

Preferred/Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

  • Omniverse-Dreams-v1-2B: Single-view model generating the front camera view from the ego vehicle.

Checkpoints:

Four single-view checkpoints are released under single_view/ in the model repository. They cover the three training-pipeline stages — a 189-frame bidirectional HD-map–conditioned teacher (L1b), a causal Diffusion Forcing student initialization (L2a), and the Self-Forcing + DMD distilled chunk-2 model (L0) — plus a runtime-packaged release of L0:

File Role Use case
single_view/2b_res720p_30fps_i2v_hdmap_distilled.pt Released distilled · chunk-2 (~52 FPS @ 720p on 1× GB300) Real-time inference target. Loaded via the flashdreams runtime.
single_view/distilled/e5cadda3-8f52-43b2-b621-aa3d4c9f0588_model.pt L0 distilled — causal few-step (Self-Forcing + DMD) Resumable input for further self-forcing distillation experiments in samples/post-training.
single_view/teacher/3b4c21d0-7b77-4694-9d9d-6ac9b6dbba51_model.pt L1b teacher — bidirectional, HD-map conditioned, 189-frame Input as net_real_score_ckpt (teacher) for distillation.
single_view/student-init/a12bf26e-8855-4ff0-a651-e7c2a0cb0697_model.pt L2a student-init — causal many-step (Diffusion Forcing) Input as the student initialization for distillation.

The diagram below traces the pipeline from the Cosmos foundation model and ~21k hours of real driving data to the released runtime artifact:

The UUID-named files match identifiers in post-training/omnidreams/checkpoints_omnidreams.py. The friendly-named release checkpoint is the chunk-2 runtime artifact for flashdreams; the three UUID-named checkpoints are training inputs for advanced users who want to extend or reproduce the Self-Forcing + DMD distillation pipeline.

Training, Testing, and Evaluation Datasets:

Dataset Overview

Training Dataset:

Internal Driving Dataset.

** Data Modality

  • [Text]
  • [Video]

** Text Training Data Size

  • 1 Billion to 10 Trillion Tokens

** Video Training Data Size

  • 10,000 to 1 Million Hours

** Data Collection Method

  • Automatic/Sensors

** Labeling Method

  • Automatic/Sensors

Properties:

The training dataset consists of approximately 4 million multi-view driving video recordings, each captured as a 20-second clip from vehicle-mounted cameras. Each recording contains synchronized views from multiple cameras covering the surrounding driving environment.

In addition to video data, each recording includes textual descriptions and structured map annotations. The textual descriptions provide high-level scene context such as environmental conditions or scenario attributes, while the map annotations encode structured information about the driving environment, including lane geometry and object-level information derived from onboard sensors.

The dataset primarily consists of sensor-collected real-world driving data processed through automated perception and mapping pipelines. Data modalities include multi-view video frames, structured scene annotations, and associated textual metadata used to condition the generative world model during training.

Testing Dataset:

Internal Driving Dataset.

** Data Collection Method

  • Automatic/Sensors

** Labeling Method

  • Automatic/Sensors

Properties Testing split is a held-out 5% subset of the same dataset used for training.

Evaluation Dataset:

Internal Driving Dataset.

** Data Collection Method

  • Automatic/Sensors

** Labeling Method

  • Automatic/Sensors

Properties Evaluation split is a held-out 5% subset of the same dataset used for training.

Inference:

Acceleration Engine: flashdreams (TensorRT, CUDA Graphs)
Test Hardware:

  • NVIDIA Hopper (H100)
  • NVIDIA Blackwell (B300, RTX 6000 Pro)

Limitations

Despite various improvements in world generation for Physical AI, models may struggle to generate long, high-resolution videos without artifacts. Common issues include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, simulation outputs should be validated and used alongside other evaluation methods when assessing autonomous driving systems.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field Response
Participation considerations from adversely impacted groups protected classes in model design and testing: Not Applicable.
Measures taken to mitigate against unwanted bias: Training data is collected from diverse real-world driving environments and processed through standardized perception and mapping pipelines.
Bias Metric (If Measured): Not Applicable.

Explainability

Field Response
Intended Task/Domain: Generative Closed-Loop Autonomous Vehicle Simulation
Model Type: Transformer
Intended Users: Researchers and developers working on autonomous driving simulation and generative world models.
Output: Video
Describe how the model works: OmniDreams is an action-conditioned generative world model that synthesizes photorealistic camera observations for autonomous driving simulation. The model receives a structured simulator state (e.g., lane geometry and actor bounding boxes), optional text prompts describing environmental conditions, and a temporal memory of previous frames. These inputs are encoded and processed through a transformer-based diffusion architecture to generate the next set of camera frames. Generation occurs autoregressively at each simulation step, allowing the model to respond to policy actions and maintain temporal consistency across long rollouts.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: Not Applicable
Technical Limitations & Mitigation: As a generative simulation model, OmniDreams may produce inconsistencies in rare or complex scenarios, such as environments significantly different from the training distribution. Mitigation strategies include validating simulation outputs through automated evaluation and visual inspection.
Verified to have met prescribed NVIDIA quality standards: Yes
Performance Metrics: Image quality metrics, temporal consistency, latency, throughput, and visual inspection of generated sensor observations in closed-loop rollouts.
Potential Known Risks: This model is intended for simulation purposes and is not intended for deployment in a vehicle. This model may generate artifacts. Potential known risks include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, simulation outputs should be validated and used alongside other evaluation methods when assessing autonomous driving systems. The model's output can generate all forms of videos, including what may be considered toxic, offensive, or indecent.
Licensing: Use of the model is governed by the NVIDIA Open Model License Agreement.

Privacy

Field Response
Generatable or reverse engineerable personal data? No
Personal data used to create this model? Yes
Was consent obtained for any personal data used? Not Applicable
How often is dataset reviewed? During dataset creation, model training, evaluation and before release
Is a mechanism in place to honor data subject right of access or deletion of personal data? Yes
If personal data was collected for the development of the model, was it collected directly by NVIDIA? Yes
If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? Yes
If personal data was collected for the development of this AI model, was it minimized to only what was required? Yes
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? No
Is there provenance for all datasets used in training? Yes
Does data labeling (annotation, metadata) comply with privacy laws? Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made? Yes
Applicable Privacy Policy https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety

Field Response
Model Application Field(s): Autonomous Vehicles Simulation
Describe the life critical impact (if present). This model is intended for simulation purposes and is not intended for deployment in a vehicle. Simulation outputs should be validated and used alongside other evaluation methods when assessing autonomous driving systems.
Use Case Restrictions: Use of the model is governed by the NVIDIA Open Model License Agreement.
Model and dataset restrictions: The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/omni-dreams-models

Finetuned
(8)
this model

Collection including nvidia/omni-dreams-models