Update README.md
Browse files
README.md
CHANGED
|
@@ -8,131 +8,133 @@ tags:
|
|
| 8 |
- nvidia
|
| 9 |
- cosmos
|
| 10 |
- diffusers
|
| 11 |
-
|
|
|
|
|
|
|
| 12 |
extra_gated_prompt: >-
|
| 13 |
# NVIDIA Open Model License Agreement
|
| 14 |
|
| 15 |
-
Version Release Date:
|
| 16 |
|
| 17 |
-
This NVIDIA Open Model License Agreement (the
|
| 18 |
-
|
| 19 |
-
identified, You and NVIDIA Corporation and its Affiliates
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
collectively the "<ins>parties</ins>."
|
| 23 |
|
| 24 |
NVIDIA models released under this Agreement are intended to be used
|
| 25 |
-
permissively and enable the further development of AI
|
| 26 |
-
the terms of this Agreement, NVIDIA confirms that:
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
* NVIDIA does not claim ownership to any outputs generated using the Models or
|
| 33 |
-
Model Derivatives.
|
| 34 |
|
| 35 |
By using, reproducing, modifying, distributing, performing or displaying any
|
| 36 |
-
portion or element of the Model or Derivative Model, or
|
| 37 |
-
the terms of this Agreement, you agree to be bound by this Agreement.
|
| 38 |
|
| 39 |
## 1. Definitions
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
|
|
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
|
|
|
| 52 |
|
| 53 |
## 2. Conditions for Use, License Grant, AI Ethics and IP Ownership
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
## 3. Redistribution
|
| 64 |
|
| 65 |
-
You may reproduce and distribute copies of the Model or Derivative Models
|
| 66 |
-
|
| 67 |
-
the following conditions:
|
| 68 |
-
|
| 69 |
-
3.1. If you distribute the Model, You must give any other recipients of the Model a copy of this Agreement and include the following attribution notice within a "Notice" text file with such copies: "Licensed by NVIDIA Corporation under the NVIDIA Open Model License";
|
| 70 |
-
|
| 71 |
-
3.2. If you distribute or make available a NVIDIA Cosmos Model, or a product or service (including an AI model) that contains or uses a NVIDIA Cosmos Model, use a NVIDIA Cosmos Model to create a Derivative Model, or use a NVIDIA Cosmos Model or its outputs to create, train, fine tune, or otherwise improve an AI model, you will include "Built on NVIDIA Cosmos" on a related website, user interface, blogpost, about page, or product documentation; and
|
| 72 |
-
|
| 73 |
-
3.3. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Models as a whole, provided Your use, reproduction, and distribution of the Model otherwise complies with the conditions stated in this Agreement.
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
content of the "Notice" text file.
|
| 81 |
|
| 82 |
-
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 89 |
-
appropriateness of using or redistributing the Model, Derivative Models and
|
| 90 |
-
outputs and assume any risks associated with Your exercise of permissions
|
| 91 |
-
under this Agreement.**
|
| 92 |
|
| 93 |
-
##
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
be liable to You for damages, including any direct, indirect, special,
|
| 99 |
-
incidental, or consequential damages of any character arising as a result of
|
| 100 |
-
this Agreement or out of the use or inability to use the Model, Derivative
|
| 101 |
-
Models or outputs (including but not limited to damages for loss of goodwill,
|
| 102 |
-
work stoppage, computer failure or malfunction, or any and all other
|
| 103 |
-
commercial damages or losses), even if NVIDIA has been advised of the
|
| 104 |
-
possibility of such damages.**
|
| 105 |
|
| 106 |
-
## 7.
|
|
|
|
|
|
|
| 107 |
|
| 108 |
-
You will indemnify and hold
|
| 109 |
-
third
|
| 110 |
-
|
| 111 |
|
| 112 |
-
##
|
|
|
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
|
|
|
| 116 |
|
| 117 |
-
##
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
States and the laws of the State of Delaware, without regard to conflict of
|
| 121 |
-
laws principles or the United Nations Convention on Contracts for the
|
| 122 |
-
International Sale of Goods. The state and federal courts residing in Santa
|
| 123 |
-
Clara County, California will have exclusive jurisdiction over any dispute or
|
| 124 |
-
claim arising out of or related to this Agreement, and the parties irrevocably
|
| 125 |
-
consent to personal jurisdiction and venue in those courts; except that,
|
| 126 |
-
either party may apply for injunctive remedies or an equivalent type of urgent
|
| 127 |
-
legal relief in any jurisdiction.
|
| 128 |
-
|
| 129 |
-
## 10. Trade and Compliance
|
| 130 |
-
|
| 131 |
-
You agree to comply with all applicable export, import, trade and economic
|
| 132 |
-
sanctions laws and regulations, as amended, including without limitation U.S.
|
| 133 |
-
Export Administration Regulations and Office of Foreign Assets Control
|
| 134 |
-
regulations. These laws include restrictions on destinations, end-users and
|
| 135 |
-
end-use.
|
| 136 |
extra_gated_fields:
|
| 137 |
By clicking Submit below, I accept the terms of the NVIDIA Open Model License Agreement and acknowledge that I am an adult of legal age of majority in the country in which the Cosmos Models will be used and have authority to accept this Agreement: checkbox
|
| 138 |
extra_gated_description: >-
|
|
@@ -144,7 +146,9 @@ extra_gated_button_content: Submit
|
|
| 144 |
|
| 145 |
# **Cosmos-Predict2.5: A Suite of Diffusion-based World Foundation Models**
|
| 146 |
|
| 147 |
-
[**Cosmos**](https://huggingface.co/collections/nvidia/cosmos-
|
|
|
|
|
|
|
| 148 |
|
| 149 |
# Model Overview
|
| 150 |
|
|
@@ -152,7 +156,9 @@ extra_gated_button_content: Submit
|
|
| 152 |
|
| 153 |
**Cosmos-Predict2.5**: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware images, videos and world states for physical AI development.
|
| 154 |
|
| 155 |
-
Cosmos-Predict2.5 diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality images and videos from text, image, or video inputs. It can serve as the building block for various applications or research that are related to world generation.
|
|
|
|
|
|
|
| 156 |
|
| 157 |
**Model Developer**: NVIDIA
|
| 158 |
|
|
@@ -160,16 +166,45 @@ Cosmos-Predict2.5 diffusion models are a collection of diffusion based world fou
|
|
| 160 |
|
| 161 |
The Cosmos-Predict2.5 diffusion-based model family includes the following models:
|
| 162 |
|
| 163 |
-
- Cosmos-Predict2.5-2B
|
| 164 |
-
- Given a text, an image as the first frame, or a video predict the future frames.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
- Produces 720P video with 16FPS
|
| 166 |
-
|
| 167 |
-
|
|
|
|
| 168 |
- Produces 720P video with 16FPS
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
### License
|
| 171 |
|
| 172 |
-
This model is released under the [NVIDIA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
### Deployment Geography:
|
| 175 |
|
|
@@ -181,17 +216,23 @@ Physical AI: encompassing robotics, autonomous vehicles (AV), and more.
|
|
| 181 |
|
| 182 |
### Release Date:
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
| 185 |
|
| 186 |
## Model Architecture
|
| 187 |
|
| 188 |
Cosmos-Predict2.5-2B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
## Input/Output Specifications
|
| 191 |
|
| 192 |
* **Input**
|
| 193 |
|
| 194 |
-
* **Input Type(s)**: Text
|
| 195 |
* **Input Format(s)**:
|
| 196 |
* Text: String
|
| 197 |
* Image: jpg, png, jpeg, webp
|
|
@@ -219,30 +260,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 219 |
|
| 220 |
**Runtime Engine(s):**
|
| 221 |
|
| 222 |
-
* [Cosmos-Predict2](https://github.com/nvidia-cosmos/cosmos-predict2)
|
| 223 |
-
* [Diffusers](https://github.com/huggingface/diffusers)
|
| 224 |
-
|
| 225 |
-
```python
|
| 226 |
-
import torch
|
| 227 |
-
from diffusers import Cosmos2VideoToWorldPipeline
|
| 228 |
-
from diffusers.utils import export_to_video, load_image
|
| 229 |
-
|
| 230 |
-
# Available checkpoints: nvidia/Cosmos-Predict2-2B-Video2World, nvidia/Cosmos-Predict2-14B-Video2World
|
| 231 |
-
model_id = "nvidia/Cosmos-Predict2-2B-Video2World"
|
| 232 |
-
pipe = Cosmos2VideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
|
| 233 |
-
pipe.to("cuda")
|
| 234 |
-
|
| 235 |
-
prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
|
| 236 |
-
negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
|
| 237 |
-
image = load_image(
|
| 238 |
-
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yellow-scrubber.png"
|
| 239 |
-
)
|
| 240 |
-
|
| 241 |
-
video = pipe(
|
| 242 |
-
image=image, prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
|
| 243 |
-
).frames[0]
|
| 244 |
-
export_to_video(video, "output.mp4", fps=16)
|
| 245 |
-
```
|
| 246 |
|
| 247 |
**Supported Hardware Microarchitecture Compatibility:**
|
| 248 |
|
|
@@ -252,15 +270,42 @@ export_to_video(video, "output.mp4", fps=16)
|
|
| 252 |
|
| 253 |
**Note**: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
|
| 254 |
|
| 255 |
-
|
| 256 |
|
| 257 |
-
|
| 258 |
|
| 259 |
-
**
|
|
|
|
|
|
|
|
|
|
| 260 |
|
| 261 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
|
| 263 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 264 |
|
| 265 |
Video2World (720p, 16FPS): This model requires 32.54 GB of GPU VRAM. The following table shows inference time for a single generation across different NVIDIA GPU hardware:
|
| 266 |
|
|
@@ -275,41 +320,24 @@ Video2World (720p, 16FPS): This model requires 32.54 GB of GPU VRAM. The followi
|
|
| 275 |
| L40S | 2567.1 s |
|
| 276 |
| RTX PRO 6000 Blackwell | 452.2 s |
|
| 277 |
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
| GPU Hardware | Inference Runtime |
|
| 282 |
-
| --------------------------------------- | ----------------- |
|
| 283 |
-
| NVIDIA GB200 | 3.39 sec |
|
| 284 |
-
| NVIDIA B200 | 3.24 sec |
|
| 285 |
-
| NVIDIA RTX PRO 6000 Workstation Edition | 5.59 sec |
|
| 286 |
-
| NVIDIA H200 SXM | 9.02 sec |
|
| 287 |
-
| NVIDIA H200 NVL | 6.34 sec |
|
| 288 |
-
| NVIDIA H100 PCIe | 11.12 sec |
|
| 289 |
-
| NVIDIA H100 NVL | 5.05 sec |
|
| 290 |
-
| NVIDIA H20 | 11.47 sec |
|
| 291 |
-
| NVIDIA L40S | 8.9 sec |
|
| 292 |
-
| NVIDIA RTX 6000 Ada Generation | 11.94 sec |
|
| 293 |
|
| 294 |
# Usage
|
| 295 |
|
| 296 |
-
* See [Cosmos-Predict2](https://github.com/nvidia-cosmos/cosmos-predict2) for details.
|
| 297 |
-
|
| 298 |
-
# Evaluation
|
| 299 |
-
|
| 300 |
-
Evaluation details for this model are forthcoming. Please visit our [website](https://research.nvidia.com/labs/dir/cosmos-predict2/) for updates and detailed benchmarks once available.
|
| 301 |
|
| 302 |
-
|
| 303 |
|
| 304 |
-
|
| 305 |
|
| 306 |
-
|
| 307 |
|
| 308 |
-
* Hybrid: Human,Automated
|
| 309 |
|
| 310 |
-
##
|
|
|
|
| 311 |
|
| 312 |
-
|
| 313 |
|
| 314 |
## Ethical Considerations
|
| 315 |
|
|
@@ -317,7 +345,7 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe
|
|
| 317 |
|
| 318 |
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
|
| 319 |
|
| 320 |
-
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|
| 321 |
|
| 322 |
### Plus Plus (++) Promise
|
| 323 |
|
|
@@ -329,4 +357,44 @@ We value you, the datasets, the diversity they represent, and what we have been
|
|
| 329 |
* Characterized for technical limitations.
|
| 330 |
* Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
|
| 331 |
* Reviewed before release.
|
| 332 |
-
* Tagged for known restrictions and potential safety implications.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
- nvidia
|
| 9 |
- cosmos
|
| 10 |
- diffusers
|
| 11 |
+
- text2video
|
| 12 |
+
- image2video
|
| 13 |
+
- video2video
|
| 14 |
extra_gated_prompt: >-
|
| 15 |
# NVIDIA Open Model License Agreement
|
| 16 |
|
| 17 |
+
Version Release Date: September 23, 2025
|
| 18 |
|
| 19 |
+
This NVIDIA Open Model License Agreement (the “Agreement”) is a legal
|
| 20 |
+
agreement between the Legal Entity You represent, or if no entity is
|
| 21 |
+
identified, You and NVIDIA Corporation and its Affiliates (“NVIDIA”) and
|
| 22 |
+
governs Your use of the Models that NVIDIA provides to You under this
|
| 23 |
+
Agreement. NVIDIA and You are each a “party” and collectively the “parties.”
|
|
|
|
| 24 |
|
| 25 |
NVIDIA models released under this Agreement are intended to be used
|
| 26 |
+
permissively and enable the further development of AI technologies. Subject
|
| 27 |
+
to the terms of this Agreement, NVIDIA confirms that:
|
| 28 |
|
| 29 |
+
- Models are commercially usable. - You are free to create and distribute
|
| 30 |
+
Derivative Models. - NVIDIA does not claim ownership to any outputs generated
|
| 31 |
+
using the Models or Model Derivatives.
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
By using, reproducing, modifying, distributing, performing or displaying any
|
| 34 |
+
portion or element of the Model or Derivative Model, or otherwise accepting
|
| 35 |
+
the terms of this Agreement, you agree to be bound by this Agreement.
|
| 36 |
|
| 37 |
## 1. Definitions
|
| 38 |
|
| 39 |
+
1.1. **Derivative Model** means all (a) modifications to the Model, (b) works
|
| 40 |
+
based on the Model, and (c) any other derivative works of the Model. An
|
| 41 |
+
output is not a Derivative Model.
|
| 42 |
|
| 43 |
+
1.2. **Legal Entity** means the union of the acting entity and all other
|
| 44 |
+
entities that control, are controlled by, or are under common control with
|
| 45 |
+
that entity. For the purposes of this definition, “control” means (a) the
|
| 46 |
+
power, direct or indirect, to cause the direction or management of such
|
| 47 |
+
entity, whether by contract or otherwise, or (b) ownership of fifty percent
|
| 48 |
+
(50%) or more of the outstanding shares, or (c) beneficial ownership of such
|
| 49 |
+
entity.
|
| 50 |
|
| 51 |
+
1.3. **Model** means the machine learning model, software, checkpoints, learnt
|
| 52 |
+
weights, algorithms, parameters, configuration files and documentation shared
|
| 53 |
+
under this Agreement.
|
| 54 |
|
| 55 |
+
1.4. **NVIDIA Cosmos Model** means a multimodal Model shared under this
|
| 56 |
+
Agreement.
|
| 57 |
|
| 58 |
+
1.5. **Special-Purpose Model** means a Model that is only competent in a
|
| 59 |
+
narrow set of purpose-specific tasks and should not be used for unintended or
|
| 60 |
+
general-purpose applications.
|
| 61 |
|
| 62 |
+
1.6. **You** or **Your** means an individual or Legal Entity exercising
|
| 63 |
+
permissions granted by this Agreement.
|
| 64 |
|
| 65 |
## 2. Conditions for Use, License Grant, AI Ethics and IP Ownership
|
| 66 |
|
| 67 |
+
### 2.1. Conditions for Use - The Model and any Derivative Model are subject
|
| 68 |
+
to additional terms as described in Section 2 and Section 3 of this
|
| 69 |
+
Agreement. - If You institute copyright or patent litigation against any
|
| 70 |
+
entity alleging that the Model or a Derivative Model constitutes infringement,
|
| 71 |
+
then any licenses granted will terminate as of the date such litigation is
|
| 72 |
+
filed. - If You bypass or disable any technical limitation, safety
|
| 73 |
+
guardrail, encryption, DRM, or authentication mechanism contained in the Model
|
| 74 |
+
without a substantially similar Guardrail, your rights will terminate. -
|
| 75 |
+
NVIDIA may designate a Model as a Special-Purpose Model. - NVIDIA may update
|
| 76 |
+
this Agreement to comply with legal and regulatory requirements.
|
| 77 |
+
|
| 78 |
+
### 2.2. License Grant NVIDIA grants You a perpetual, worldwide,
|
| 79 |
+
non-exclusive, no-charge, royalty-free, revocable license to publicly perform,
|
| 80 |
+
publicly display, reproduce, use, create derivative works of, make, have made,
|
| 81 |
+
sell, offer for sale, distribute, and import the Model.
|
| 82 |
+
|
| 83 |
+
### 2.3. AI Ethics Use of the Models must be consistent with NVIDIA’s
|
| 84 |
+
[Trustworthy AI
|
| 85 |
+
terms](https://www.nvidia.com/en-us/agreements/trustworthy-ai/terms/).
|
| 86 |
+
|
| 87 |
+
### 2.4. IP Ownership - NVIDIA owns the Model and any Model Derivatives it
|
| 88 |
+
creates. - You own your Model Derivatives. - NVIDIA claims no ownership
|
| 89 |
+
rights in outputs. - Except as expressly granted, NVIDIA reserves all
|
| 90 |
+
rights.
|
| 91 |
|
| 92 |
## 3. Redistribution
|
| 93 |
|
| 94 |
+
You may reproduce and distribute copies of the Model or Derivative Models in
|
| 95 |
+
any medium, with or without modifications, provided that:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
- **3.1.** You must provide recipients with a copy of this Agreement and
|
| 98 |
+
include this attribution in a “Notice” text file:
|
| 99 |
+
*“Licensed by NVIDIA Corporation under the NVIDIA Open Model License”*
|
| 100 |
|
| 101 |
+
- **3.2.** If distributing or making available a NVIDIA Cosmos Model, or
|
| 102 |
+
products/services derived from it, you must include:
|
| 103 |
+
*“Built on NVIDIA Cosmos”*
|
|
|
|
| 104 |
|
| 105 |
+
- **3.3.** You may add your own copyright statements and license terms for
|
| 106 |
+
your modifications, provided use still complies with this Agreement.
|
| 107 |
|
| 108 |
+
## 4. Separate Components The Models may include components licensed under
|
| 109 |
+
separate legal notices (e.g., Open Source Software Licenses). These terms
|
| 110 |
+
apply, except where overridden by this Agreement unless required by
|
| 111 |
+
third-party license terms.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
## 5. Trademarks No permission is granted to use NVIDIA’s trade names,
|
| 114 |
+
trademarks, or product names, except for reasonable descriptive use.
|
| 115 |
|
| 116 |
+
## 6. Disclaimer of Warranty The Model is provided **“AS IS”**, without
|
| 117 |
+
warranties of any kind, including title, non-infringement, merchantability, or
|
| 118 |
+
fitness for purpose. You assume risks associated with its use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
+
## 7. Limitation of Liability NVIDIA is not liable for damages (direct,
|
| 121 |
+
indirect, incidental, or consequential) arising from use of the Model, unless
|
| 122 |
+
required by law.
|
| 123 |
|
| 124 |
+
## 8. Indemnity You will indemnify and hold NVIDIA harmless against claims
|
| 125 |
+
from third parties arising from your use or distribution of the Model,
|
| 126 |
+
derivatives, or outputs.
|
| 127 |
|
| 128 |
+
## 9. Feedback NVIDIA may use any feedback you provide without restriction or
|
| 129 |
+
compensation.
|
| 130 |
|
| 131 |
+
## 10. Governing Law This Agreement is governed by U.S. and Delaware law.
|
| 132 |
+
Courts in Santa Clara County, California, have exclusive jurisdiction, except
|
| 133 |
+
for urgent injunctive relief.
|
| 134 |
|
| 135 |
+
## 11. Trade and Compliance You must comply with all export, import, trade,
|
| 136 |
+
and sanctions laws, including U.S. Export Administration Regulations and OFAC
|
| 137 |
+
rules.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
extra_gated_fields:
|
| 139 |
By clicking Submit below, I accept the terms of the NVIDIA Open Model License Agreement and acknowledge that I am an adult of legal age of majority in the country in which the Cosmos Models will be used and have authority to accept this Agreement: checkbox
|
| 140 |
extra_gated_description: >-
|
|
|
|
| 146 |
|
| 147 |
# **Cosmos-Predict2.5: A Suite of Diffusion-based World Foundation Models**
|
| 148 |
|
| 149 |
+
[**Cosmos**](https://huggingface.co/collections/nvidia/cosmos-predict25-68bb63255f2fc206c5e5b346) | [**Code**](https://github.com/nvidia-cosmos/cosmos-predict2.5) | [**White Paper**](https://arxiv.org/abs/2511.00062) | [**Website**](https://research.nvidia.com/labs/dir/cosmos-predict2.5)
|
| 150 |
+
|
| 151 |
+
[NVIDIA Cosmos™](https://github.com/nvidia-cosmos) is a platform of state-of-the-art generative world foundation models, advanced tokenizers, guardrails, and an accelerated data processing and curation pipeline, purpose-built to accelerate the development of physical AI systems, such as autonomous vehicles (AVs) and robots.
|
| 152 |
|
| 153 |
# Model Overview
|
| 154 |
|
|
|
|
| 156 |
|
| 157 |
**Cosmos-Predict2.5**: A family of highly performant pre-trained world foundation models purpose-built for generating physics-aware images, videos and world states for physical AI development.
|
| 158 |
|
| 159 |
+
Cosmos-Predict2.5 diffusion models are a collection of diffusion based world foundation models that generate dynamic, high quality images and videos from text, image, or video inputs. It can serve as the building block for various applications or research that are related to world generation.
|
| 160 |
+
|
| 161 |
+
This model is ready for commercial/non-commercial use.
|
| 162 |
|
| 163 |
**Model Developer**: NVIDIA
|
| 164 |
|
|
|
|
| 166 |
|
| 167 |
The Cosmos-Predict2.5 diffusion-based model family includes the following models:
|
| 168 |
|
| 169 |
+
- Cosmos-Predict2.5-2B/ Pre-trained
|
| 170 |
+
- Given a text description, an image as the first frame, and/or a video, predict the future frames.
|
| 171 |
+
- Produces 720P video with 16FPS
|
| 172 |
+
|
| 173 |
+
- Cosmos-Predict2.5-2B/ Post-trained
|
| 174 |
+
- Given a text description, an image as the first frame, and/or a video, predict the future frames.
|
| 175 |
+
- Produces 720P video with 16FPS
|
| 176 |
+
|
| 177 |
+
- Cosmos-Predict2.5-2B/ Auto/ Multiview
|
| 178 |
+
- Given a text description, an image as the first frame, and/or a video, predict world senario in 7-camera views .
|
| 179 |
+
- Produces 720P video with 16FPS
|
| 180 |
+
|
| 181 |
+
- Cosmos-Predict2.5-2B/ Robot / Multiview
|
| 182 |
+
- Given a text description, a static video, and two target camera trajectories, predict two re-rendered videos.
|
| 183 |
- Produces 720P video with 16FPS
|
| 184 |
+
|
| 185 |
+
- Cosmos-Predict2.5-2B/ Robot / Multiview-Agibot
|
| 186 |
+
- Given a text description, a head-view video, and two target hand-view camera trajectories, predict two head-view videos.
|
| 187 |
- Produces 720P video with 16FPS
|
| 188 |
+
|
| 189 |
+
- Cosmos-Predict2.5-2B/ Robot / Action-Cond
|
| 190 |
+
- Given image as the first frame and a robot action sequence as condition, predict the future frames.
|
| 191 |
+
- Produces 256p video with 4FPS
|
| 192 |
|
| 193 |
### License
|
| 194 |
|
| 195 |
+
This model is released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
|
| 196 |
+
|
| 197 |
+
For a custom license, please contact [cosmos-license@nvidia.com](mailto:cosmos-license@nvidia.com).
|
| 198 |
+
|
| 199 |
+
Under the NVIDIA Open Model License, NVIDIA confirms:
|
| 200 |
+
|
| 201 |
+
* Models are commercially usable.
|
| 202 |
+
* You are free to create and distribute Derivative Models.
|
| 203 |
+
* NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
|
| 204 |
+
|
| 205 |
+
**Important Note**: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, **safety guardrail** or
|
| 206 |
+
associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained
|
| 207 |
+
in the Model, your rights under [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license) will automatically terminate.
|
| 208 |
|
| 209 |
### Deployment Geography:
|
| 210 |
|
|
|
|
| 216 |
|
| 217 |
### Release Date:
|
| 218 |
|
| 219 |
+
Github [10/06/2025] via https://github.com/nvidia-cosmos/cosmos-predict2.5
|
| 220 |
+
|
| 221 |
+
Hugging Face [10/06/2025] via https://huggingface.co/collections/nvidia/cosmos-predict25-68bb63255f2fc206c5e5b346
|
| 222 |
|
| 223 |
## Model Architecture
|
| 224 |
|
| 225 |
Cosmos-Predict2.5-2B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
|
| 226 |
|
| 227 |
+
**This model was developed based on:** [Cosmos-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
|
| 228 |
+
|
| 229 |
+
**Number of model parameters:** 2,059,174,912
|
| 230 |
+
|
| 231 |
## Input/Output Specifications
|
| 232 |
|
| 233 |
* **Input**
|
| 234 |
|
| 235 |
+
* **Input Type(s)**: Text+Image, Text+Video
|
| 236 |
* **Input Format(s)**:
|
| 237 |
* Text: String
|
| 238 |
* Image: jpg, png, jpeg, webp
|
|
|
|
| 260 |
|
| 261 |
**Runtime Engine(s):**
|
| 262 |
|
| 263 |
+
* [Cosmos-Predict2.5](https://github.com/nvidia-cosmos/cosmos-predict2.5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 264 |
|
| 265 |
**Supported Hardware Microarchitecture Compatibility:**
|
| 266 |
|
|
|
|
| 270 |
|
| 271 |
**Note**: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
|
| 272 |
|
| 273 |
+
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
|
| 274 |
|
| 275 |
+
## Training Dataset:
|
| 276 |
|
| 277 |
+
**Data Modality** <br>
|
| 278 |
+
* [Image] <br>
|
| 279 |
+
* [Text] <br>
|
| 280 |
+
* [Video] <br>
|
| 281 |
|
| 282 |
+
**Data Collection Method by dataset** <br>
|
| 283 |
+
* [Automated] <br>
|
| 284 |
+
|
| 285 |
+
**Labeling Method by dataset** <br>
|
| 286 |
+
* [Hybrid: Human, Automated] <br>
|
| 287 |
+
|
| 288 |
+
### Testing Dataset:
|
| 289 |
+
|
| 290 |
+
**Data Collection Method by dataset** <br>
|
| 291 |
+
* [Automated] <br>
|
| 292 |
+
|
| 293 |
+
**Labeling Method by dataset** <br>
|
| 294 |
+
* [Hybrid: Human, Automated] <br>
|
| 295 |
+
|
| 296 |
+
# Evaluation
|
| 297 |
|
| 298 |
+
Please see our [technical paper](https://research.nvidia.com/publication/2025-09_world-simulation-video-foundation-models-physical-ai) for detailed evaluations of the base model.
|
| 299 |
+
|
| 300 |
+
**Data Collection Method**:
|
| 301 |
+
|
| 302 |
+
* Automated
|
| 303 |
+
|
| 304 |
+
**Labeling Method**:
|
| 305 |
+
|
| 306 |
+
* Hybrid: Human,Automated
|
| 307 |
+
|
| 308 |
+
*System Requirements and Performance**
|
| 309 |
|
| 310 |
Video2World (720p, 16FPS): This model requires 32.54 GB of GPU VRAM. The following table shows inference time for a single generation across different NVIDIA GPU hardware:
|
| 311 |
|
|
|
|
| 320 |
| L40S | 2567.1 s |
|
| 321 |
| RTX PRO 6000 Blackwell | 452.2 s |
|
| 322 |
|
| 323 |
+
**Operating System(s):**
|
| 324 |
+
* Linux (We have not tested on other operating systems.)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 325 |
|
| 326 |
# Usage
|
| 327 |
|
| 328 |
+
* See [Cosmos-Predict2.5](https://github.com/nvidia-cosmos/cosmos-predict2.5) for details.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 329 |
|
| 330 |
+
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
|
| 331 |
|
| 332 |
+
## Limitations
|
| 333 |
|
| 334 |
+
Despite various improvements in world generation for Physical AI, Cosmos-Predict2 video2world models still face technical and application limitations for world prediction. In particular, they struggle to generate long, high-resolution videos without artifacts. Common issues include temporal inconsistency, camera and object motion instability, and imprecise interactions. The models may inaccurately represent 3D space, 4D space-time, or physical laws in the generated videos, leading to artifacts such as disappearing or morphing objects, unrealistic interactions, and implausible motions. As a result, applying these models for applications that require simulating physical law-grounded environments or complex multi-agent dynamics remains challenging.
|
| 335 |
|
|
|
|
| 336 |
|
| 337 |
+
## Inference:
|
| 338 |
+
**Acceleration Engine**: [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
|
| 339 |
|
| 340 |
+
**Test Hardware:** H100, A100, GB200
|
| 341 |
|
| 342 |
## Ethical Considerations
|
| 343 |
|
|
|
|
| 345 |
|
| 346 |
Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
|
| 347 |
|
| 348 |
+
For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|
| 349 |
|
| 350 |
### Plus Plus (++) Promise
|
| 351 |
|
|
|
|
| 357 |
* Characterized for technical limitations.
|
| 358 |
* Reviewed to ensure proper disclosure is accessible to, maintained for, and in compliance with NVIDIA data subjects and their requests.
|
| 359 |
* Reviewed before release.
|
| 360 |
+
* Tagged for known restrictions and potential safety implications.
|
| 361 |
+
|
| 362 |
+
### Bias
|
| 363 |
+
| Field | Response |
|
| 364 |
+
| :-------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------- |
|
| 365 |
+
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
|
| 366 |
+
| Measures taken to mitigate against unwanted bias: | None |
|
| 367 |
+
|
| 368 |
+
### Explainability
|
| 369 |
+
Field | Response
|
| 370 |
+
:------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------
|
| 371 |
+
Intended Application & Domain: | World Generation
|
| 372 |
+
Model Type: | Transformer
|
| 373 |
+
Intended Users: | Physical AI developers
|
| 374 |
+
Output: | Videos
|
| 375 |
+
Describe how the model works: | Generates videos based on video and text inputs
|
| 376 |
+
Technical Limitations: | The model may not follow the video or text input accurately in challenging cases, where the input video shows complex scene composition and temporal dynamics. Examples of challenging scenes include: fast camera movements, overlapping human-object interactions, low lighting with high motion blur, and multiple people performing different actions simultaneously.
|
| 377 |
+
Verified to have met prescribed NVIDIA quality standards: | Yes
|
| 378 |
+
Performance Metrics: | Quantitative and Qualitative Evaluation. We evaluate on PAI-Bench’s predict task and report two main scores: the Domain Score, which measures performance on domain-specific physical AI tasks, and the Quality Score, which reflects the quality of generated videos. The Quality Score is derived from eight text-to-video and image-to-video metrics adapted from VBench. In contrast, the Domain Score is obtained through VQA-based evaluation across seven domains: av, common, human, industry, misc, physics, and robotics. The final PAI-Bench Overall Score is computed as the average of the Quality and Domain scores.
|
| 379 |
+
Potential Known Risks: | The model's output can generate all forms of videos, including what may be considered toxic, offensive, or indecent.
|
| 380 |
+
Licensing: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
|
| 381 |
+
|
| 382 |
+
### Privacy
|
| 383 |
+
Field | Response
|
| 384 |
+
:----------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------
|
| 385 |
+
Generatable or reverse engineerable personal data? | No
|
| 386 |
+
Personal data used to create this model? | None Known
|
| 387 |
+
Was consent obtained for any personal data used? | None Known
|
| 388 |
+
How often is dataset reviewed? | Before Release
|
| 389 |
+
Is there provenance for all datasets used in training? | Yes
|
| 390 |
+
Does data labeling (annotation, metadata) comply with privacy laws? | Yes
|
| 391 |
+
Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data.
|
| 392 |
+
Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
|
| 393 |
+
|
| 394 |
+
### Safety
|
| 395 |
+
Field | Response
|
| 396 |
+
:---------------------------------------------------|:----------------------------------
|
| 397 |
+
Model Application(s): | World Generation
|
| 398 |
+
Describe the life critical impact (if present). | None Known
|
| 399 |
+
Use Case Restrictions: | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
|
| 400 |
+
Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face, and may become available on cloud providers' model catalog.
|