--- library_name: diffusers license: other license_name: openmdw1.1-license license_link: https://openmdw.ai/license/1-1/ pipeline_tag: any-to-any tags: - nvidia - cosmos - cosmos3 - vllm - vllm-omni - text, image, video, audio, and action generation - omnimodel --- # **Cosmos 3: Omnimodal World Models for Physical AI** **[Model Collection](https://huggingface.co/collections/nvidia/cosmos3)** | **[Code](https://github.com/nvidia/cosmos)** | **[Paper](https://huggingface.co/papers/2606.02800)** | **[Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)** [NVIDIA Cosmos™](https://github.com/nvidia/cosmos) is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications. Cosmos 3 is a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. ## Sample Usage (Diffusers) Cosmos 3 is fully supported within the Hugging Face `diffusers` library. ```python import torch from diffusers import Cosmos3OmniPipeline from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler from diffusers.utils import export_to_video pipe = Cosmos3OmniPipeline.from_pretrained( "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda", ) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0) result = pipe( prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.", negative_prompt="", image=None, num_frames=189, height=720, width=1280, fps=24, num_inference_steps=35, guidance_scale=6.0, enable_sound=False, add_resolution_template=False, add_duration_template=False, generator=torch.Generator(device="cuda").manual_seed(1234), ) export_to_video(result.video, "cosmos3_t2v.mp4", fps=24, macro_block_size=1) ``` # Model Overview: Cosmos3-Nano ## Description Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning. This model is ready for commercial and non-commercial use. **Model Developer:** NVIDIA ### Model Versions - Cosmos3-Nano: - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications. - Cosmos3-Super: - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications. - Cosmos3-Nano-Policy-DROID: - Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks. - Cosmos3-Super-Image2Video: - Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content. - Cosmos3-Super-Text2Image: - Given text input, generate high-fidelity images that are consistent with the provided description. ### License This model is released under the [OpenMDW1.1](https://openmdw.ai/license/1-1/) ## Model Architecture **Architecture Type:** Transformer **Network Architecture:** Mixture-of-Transformers (MoT) Cosmos3 is an Omni-modal foundation model built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. During inference, text is generated through standard next-token autoregressive decoding, while non-text modalities, such as images, video, audio, and actions, are synthesized through iterative denoising. **Number of trainable model parameters:** - Cosmos3-Nano: 16B - Cosmos3-Super: 64B ## Software Integration **Runtime Engine(s):** - [PyTorch](https://github.com/nvidia/cosmos3) - [vLLM-Omni](https://github.com/vllm-project/vllm-omni) - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index) **Supported Hardware Microarchitecture Compatibility:** - NVIDIA Ampere - NVIDIA Blackwell - NVIDIA Hopper ## Training, Testing, and Evaluation Datasets ### Dataset Overview - **Total Size:** 1.3B data points - **Total Number of Datasets:** 393 dataset entries The training, testing, and evaluation datasets consist of diverse multimodal video, image, audio, action, synthetic, and sensor-conditioned data sourced from NVIDIA-owned data and publicly available, commercially permissive datasets. **Data Modality and Training Data Size** | Modality | Reasoning Data Sample Count | Generation Data Sample Count | | -------- | ------------------- | -------------------- | | Text | 22M | Not Applicable | | Image | 19M | 767M | | Video | 1M | 348M | | Audio | Not Applicable | 139M | | Action | Not Applicable | 8M | ## Benchmarks Please see our [technical paper](https://huggingface.co/papers/2606.02800) for detailed evaluations of the model. ## Limitations Cosmos3 may produce artifacts in long, high-resolution, or physically complex outputs. Common failure modes include temporal inconsistency, unstable camera or object motion, inaccurate sound-video alignment, imperfect action-state consistency, and physically implausible dynamics. It does not have an explicit physics engine and approximates physical laws. ## Inference **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers) **Test Hardware:** GB200 and H100