NVIDIA OmniDreams: Real-Time Generative Closed-Loop Autonomous Vehicle Simulation
Model Overview
Description:
NVIDIA OmniDreams generates photorealistic camera observations for autonomous driving simulation conditioned on simulator state and driving actions.
OmniDreams is an action-conditioned generative world model developed by NVIDIA that synthesizes multi-camera sensor observations in response to policy actions within a closed-loop simulation environment. The model operates autoregressively and maintains temporal context to support long-horizon rollouts while remaining responsive to changes in driving actions. It is designed to enable interactive testing of autonomous vehicle policies in photorealistic generative environments.
OmniDreams was developed by NVIDIA as part of the Cosmos world foundation model family and integrated into a closed-loop workflow with Alpamayo policies and the AlpaSim simulation runtime.
This model is ready for commercial/non-commercial use.
A 20-second OmniDreams rollout — generated front-view camera output.
License/Terms of Use:
Governing Download Terms: Use of the model is governed by the NVIDIA Open Model License Agreement.
Deployment Geography:
Global
Use Case:
OmniDreams is intended for researchers and developers working on autonomous vehicle simulation and machine learning systems.
Release Date:
Github 06/02/2026 via public repo https://github.com/nv-tlabs/omni-dreams
Hugging Face 06/02/2026 via public repo https://huggingface.co/nvidia/omni-dreams-models
Model Architecture:
Architecture Type: Transformer
Network Architecture: Diffusion Transformer
- Built on NVIDIA Cosmos. This model was developed based on NVIDIA Cosmos-Predict2.5.
- Number of model parameters: 2B.
Inputs:
Input Types: Image, Video world state, Text
Input Formats:
- Image: Red, Green, Blue (RGB). First camera frame.
- Video: Red, Green, Blue (RGB) image frames. Lane lines and bounding box representation of the world scenario.
- Text: String. Natural language prompt describing environmental conditions.
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
- Video: Three-Dimensional (3D)
Other Properties Related to Input: The input resolution is 704 x 1280.
Output
Output Type: Video
Output Format:
- Video: MP4
Output Parameters:
- Video: Three-Dimensional (3D)
Other Properties Related to Output: The output resolution is 704 x 1280.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine: Not Applicable - Uses Python Scripts and Pytorch
Supported Hardware Microarchitecture Compatibility:
NVIDIA Blackwell (B300, RTX 6000 Pro), NVIDIA Hopper (H100)
Preferred/Supported Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Model Version(s):
- Omniverse-Dreams-v1-2B: Single-view model generating the front camera view from the ego vehicle.
Checkpoints:
Four single-view checkpoints are released under single_view/ in the model repository. They cover the three training-pipeline stages — a 189-frame bidirectional HD-map–conditioned teacher (L1b), a causal Diffusion Forcing student initialization (L2a), and the Self-Forcing + DMD distilled chunk-2 model (L0) — plus a runtime-packaged release of L0:
| File | Role | Use case |
|---|---|---|
single_view/2b_res720p_30fps_i2v_hdmap_distilled.pt |
Released distilled · chunk-2 (~52 FPS @ 720p on 1× GB300) | Real-time inference target. Loaded via the flashdreams runtime. |
single_view/distilled/e5cadda3-8f52-43b2-b621-aa3d4c9f0588_model.pt |
L0 distilled — causal few-step (Self-Forcing + DMD) | Resumable input for further self-forcing distillation experiments in samples/post-training. |
single_view/teacher/3b4c21d0-7b77-4694-9d9d-6ac9b6dbba51_model.pt |
L1b teacher — bidirectional, HD-map conditioned, 189-frame | Input as net_real_score_ckpt (teacher) for distillation. |
single_view/student-init/a12bf26e-8855-4ff0-a651-e7c2a0cb0697_model.pt |
L2a student-init — causal many-step (Diffusion Forcing) | Input as the student initialization for distillation. |
The diagram below traces the pipeline from the Cosmos foundation model and ~21k hours of real driving data to the released runtime artifact:
The UUID-named files match identifiers in post-training/omnidreams/checkpoints_omnidreams.py. The friendly-named release checkpoint is the chunk-2 runtime artifact for flashdreams; the three UUID-named checkpoints are training inputs for advanced users who want to extend or reproduce the Self-Forcing + DMD distillation pipeline.
Training, Testing, and Evaluation Datasets:
Dataset Overview
Training Dataset:
Internal Driving Dataset.
** Data Modality
- [Text]
- [Video]
** Text Training Data Size
- 1 Billion to 10 Trillion Tokens
** Video Training Data Size
- 10,000 to 1 Million Hours
** Data Collection Method
- Automatic/Sensors
** Labeling Method
- Automatic/Sensors
Properties:
The training dataset consists of approximately 4 million multi-view driving video recordings, each captured as a 20-second clip from vehicle-mounted cameras. Each recording contains synchronized views from multiple cameras covering the surrounding driving environment.
In addition to video data, each recording includes textual descriptions and structured map annotations. The textual descriptions provide high-level scene context such as environmental conditions or scenario attributes, while the map annotations encode structured information about the driving environment, including lane geometry and object-level information derived from onboard sensors.
The dataset primarily consists of sensor-collected real-world driving data processed through automated perception and mapping pipelines. Data modalities include multi-view video frames, structured scene annotations, and associated textual metadata used to condition the generative world model during training.
Testing Dataset:
Internal Driving Dataset.
** Data Collection Method
- Automatic/Sensors
** Labeling Method
- Automatic/Sensors
Properties Testing split is a held-out 5% subset of the same dataset used for training.
Evaluation Dataset:
Internal Driving Dataset.
** Data Collection Method
- Automatic/Sensors
** Labeling Method
- Automatic/Sensors
Properties Evaluation split is a held-out 5% subset of the same dataset used for training.
Inference:
Acceleration Engine: flashdreams (TensorRT, CUDA Graphs)
Test Hardware:
- NVIDIA Hopper (H100)
- NVIDIA Blackwell (B300, RTX 6000 Pro)
Limitations
Despite various improvements in world generation for Physical AI, models may struggle to generate long, high-resolution videos without artifacts. Common issues include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, simulation outputs should be validated and used alongside other evaluation methods when assessing autonomous driving systems.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below.
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing: | Not Applicable. |
| Measures taken to mitigate against unwanted bias: | Training data is collected from diverse real-world driving environments and processed through standardized perception and mapping pipelines. |
| Bias Metric (If Measured): | Not Applicable. |
Explainability
| Field | Response |
|---|---|
| Intended Task/Domain: | Generative Closed-Loop Autonomous Vehicle Simulation |
| Model Type: | Transformer |
| Intended Users: | Researchers and developers working on autonomous driving simulation and generative world models. |
| Output: | Video |
| Describe how the model works: | OmniDreams is an action-conditioned generative world model that synthesizes photorealistic camera observations for autonomous driving simulation. The model receives a structured simulator state (e.g., lane geometry and actor bounding boxes), optional text prompts describing environmental conditions, and a temporal memory of previous frames. These inputs are encoded and processed through a transformer-based diffusion architecture to generate the next set of camera frames. Generation occurs autoregressively at each simulation step, allowing the model to respond to policy actions and maintain temporal consistency across long rollouts. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
| Technical Limitations & Mitigation: | As a generative simulation model, OmniDreams may produce inconsistencies in rare or complex scenarios, such as environments significantly different from the training distribution. Mitigation strategies include validating simulation outputs through automated evaluation and visual inspection. |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Image quality metrics, temporal consistency, latency, throughput, and visual inspection of generated sensor observations in closed-loop rollouts. |
| Potential Known Risks: | This model is intended for simulation purposes and is not intended for deployment in a vehicle. This model may generate artifacts. Potential known risks include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, simulation outputs should be validated and used alongside other evaluation methods when assessing autonomous driving systems. The model's output can generate all forms of videos, including what may be considered toxic, offensive, or indecent. |
| Licensing: | Use of the model is governed by the NVIDIA Open Model License Agreement. |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | No |
| Personal data used to create this model? | Yes |
| Was consent obtained for any personal data used? | Not Applicable |
| How often is dataset reviewed? | During dataset creation, model training, evaluation and before release |
| Is a mechanism in place to honor data subject right of access or deletion of personal data? | Yes |
| If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Yes |
| If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Yes |
| If personal data was collected for the development of this AI model, was it minimized to only what was required? | Yes |
| Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? | No |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
Safety
| Field | Response |
|---|---|
| Model Application Field(s): | Autonomous Vehicles Simulation |
| Describe the life critical impact (if present). | This model is intended for simulation purposes and is not intended for deployment in a vehicle. Simulation outputs should be validated and used alongside other evaluation methods when assessing autonomous driving systems. |
| Use Case Restrictions: | Use of the model is governed by the NVIDIA Open Model License Agreement. |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
Model tree for nvidia/omni-dreams-models
Base model
nvidia/Cosmos-Predict2.5-2B