Cosmos3-Nano / README.md
nielsr's picture
nielsr HF Staff
Add any-to-any pipeline tag, update library metadata and link paper
8238721 verified
|
raw
history blame
6.5 kB
metadata
library_name: diffusers
license: other
license_name: openmdw1.1-license
license_link: https://openmdw.ai/license/1-1/
pipeline_tag: any-to-any
tags:
  - nvidia
  - cosmos
  - cosmos3
  - vllm
  - vllm-omni
  - text, image, video, audio, and action generation
  - omnimodel

Cosmos 3: Omnimodal World Models for Physical AI

Model Collection | Code | Paper | Website

NVIDIA Cosmos™ is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.

Cosmos 3 is a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture.

Sample Usage (Diffusers)

Cosmos 3 is fully supported within the Hugging Face diffusers library.

import torch
from diffusers import Cosmos3OmniPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)

result = pipe(
    prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.",
    negative_prompt="",
    image=None,
    num_frames=189,
    height=720,
    width=1280,
    fps=24,
    num_inference_steps=35,
    guidance_scale=6.0,
    enable_sound=False,
    add_resolution_template=False,
    add_duration_template=False,
    generator=torch.Generator(device="cuda").manual_seed(1234),
)

export_to_video(result.video, "cosmos3_t2v.mp4", fps=24, macro_block_size=1)

Model Overview: Cosmos3-Nano

Description

Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning.

This model is ready for commercial and non-commercial use.

Model Developer: NVIDIA

Model Versions

  • Cosmos3-Nano:

    • Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
  • Cosmos3-Super:

    • Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
  • Cosmos3-Nano-Policy-DROID:

    • Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks.
  • Cosmos3-Super-Image2Video:

    • Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content.
  • Cosmos3-Super-Text2Image:

    • Given text input, generate high-fidelity images that are consistent with the provided description.

License

This model is released under the OpenMDW1.1

Model Architecture

Architecture Type: Transformer Network Architecture: Mixture-of-Transformers (MoT)

Cosmos3 is an Omni-modal foundation model built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. During inference, text is generated through standard next-token autoregressive decoding, while non-text modalities, such as images, video, audio, and actions, are synthesized through iterative denoising.

Number of trainable model parameters:

  • Cosmos3-Nano: 16B
  • Cosmos3-Super: 64B

Software Integration

Runtime Engine(s):

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper

Training, Testing, and Evaluation Datasets

Dataset Overview

  • Total Size: 1.3B data points
  • Total Number of Datasets: 393 dataset entries

The training, testing, and evaluation datasets consist of diverse multimodal video, image, audio, action, synthetic, and sensor-conditioned data sourced from NVIDIA-owned data and publicly available, commercially permissive datasets.

Data Modality and Training Data Size

Modality Reasoning Data Sample Count Generation Data Sample Count
Text 22M Not Applicable
Image 19M 767M
Video 1M 348M
Audio Not Applicable 139M
Action Not Applicable 8M

Benchmarks

Please see our technical paper for detailed evaluations of the model.

Limitations

Cosmos3 may produce artifacts in long, high-resolution, or physically complex outputs. Common failure modes include temporal inconsistency, unstable camera or object motion, inaccurate sound-video alignment, imperfect action-state consistency, and physically implausible dynamics. It does not have an explicit physics engine and approximates physical laws.

Inference

Acceleration Engine: PyTorch, vLLM, vLLM-Omni, Hugging Face Diffusers

Test Hardware: GB200 and H100