Cosmos
Diffusers
nvidia
text2video
image2video
video2video
harrim-nv commited on
Commit
9379e46
·
verified ·
1 Parent(s): af4223b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -33
README.md CHANGED
@@ -166,30 +166,14 @@ This model is ready for commercial/non-commercial use.
166
 
167
  The Cosmos-Predict2.5 diffusion-based model family includes the following models:
168
 
169
- - Cosmos-Predict2.5-2B/ Pre-trained
170
  - Given a text description, an image as the first frame, and/or a video, predict the future frames.
171
  - Produces 720P video with 16FPS
172
 
173
- - Cosmos-Predict2.5-2B/ Post-trained
174
  - Given a text description, an image as the first frame, and/or a video, predict the future frames.
175
  - Produces 720P video with 16FPS
176
 
177
- - Cosmos-Predict2.5-2B/ Auto/ Multiview
178
- - Given a text description, an image as the first frame, and/or a video, predict world senario in 7-camera views .
179
- - Produces 720P video with 16FPS
180
-
181
- - Cosmos-Predict2.5-2B/ Robot / Multiview
182
- - Given a text description, a static video, and two target camera trajectories, predict two re-rendered videos.
183
- - Produces 720P video with 16FPS
184
-
185
- - Cosmos-Predict2.5-2B/ Robot / Multiview-Agibot
186
- - Given a text description, a head-view video, and two target hand-view camera trajectories, predict two head-view videos.
187
- - Produces 720P video with 16FPS
188
-
189
- - Cosmos-Predict2.5-2B/ Robot / Action-Cond
190
- - Given image as the first frame and a robot action sequence as condition, predict the future frames.
191
- - Produces 256p video with 4FPS
192
-
193
  ### License
194
 
195
  This model is released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
@@ -216,17 +200,17 @@ Physical AI: encompassing robotics, autonomous vehicles (AV), and more.
216
 
217
  ### Release Date:
218
 
219
- Github [10/06/2025] via https://github.com/nvidia-cosmos/cosmos-predict2.5
220
 
221
- Hugging Face [10/06/2025] via https://huggingface.co/collections/nvidia/cosmos-predict25-68bb63255f2fc206c5e5b346
222
 
223
  ## Model Architecture
224
 
225
- Cosmos-Predict2.5-2B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
226
 
227
- **This model was developed based on:** [Cosmos-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
228
 
229
- **Number of model parameters:** 2,059,174,912
230
 
231
  ## Input/Output Specifications
232
 
@@ -307,18 +291,18 @@ Please see our [technical paper](https://research.nvidia.com/publication/2025-09
307
 
308
  *System Requirements and Performance**
309
 
310
- Video2World (720p, 16FPS): This model requires 32.54 GB of GPU VRAM. The following table shows inference time for a single generation across different NVIDIA GPU hardware:
311
 
312
  | GPU Hardware | Inference Runtime |
313
  | ---------------------- | ----------------- |
314
- | H100 SXM | 228.8 s |
315
- | H200 SXM | 221.7 s |
316
- | B200 | 123.9 s |
317
- | H100 NVL | 355.7 s |
318
- | H100 PCIe | 378.5 s |
319
- | H200 NVL | 267.2 s |
320
- | L40S | 2567.1 s |
321
- | RTX PRO 6000 Blackwell | 452.2 s |
322
 
323
  **Operating System(s):**
324
  * Linux (We have not tested on other operating systems.)
@@ -337,7 +321,7 @@ Despite various improvements in world generation for Physical AI, Cosmos-Predict
337
  ## Inference:
338
  **Acceleration Engine**: [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
339
 
340
- **Test Hardware:** H100, A100, GB200
341
 
342
  ## Ethical Considerations
343
 
 
166
 
167
  The Cosmos-Predict2.5 diffusion-based model family includes the following models:
168
 
169
+ - Cosmos-Predict2.5-14B/ Pre-trained
170
  - Given a text description, an image as the first frame, and/or a video, predict the future frames.
171
  - Produces 720P video with 16FPS
172
 
173
+ - Cosmos-Predict2.5-14B/ Post-trained
174
  - Given a text description, an image as the first frame, and/or a video, predict the future frames.
175
  - Produces 720P video with 16FPS
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ### License
178
 
179
  This model is released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
 
200
 
201
  ### Release Date:
202
 
203
+ Github [12/04/2025] via https://github.com/nvidia-cosmos/cosmos-predict2.5
204
 
205
+ Hugging Face [12/04/2025] via https://huggingface.co/collections/nvidia/cosmos-predict25
206
 
207
  ## Model Architecture
208
 
209
+ Cosmos-Predict2.5-14B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
210
 
211
+ **This model was developed based on:** [Cosmos-Predict2-14B](https://huggingface.co/nvidia/Cosmos-Predict2-14B-Video2World)
212
 
213
+ **Number of model parameters:** 14,000,000,000
214
 
215
  ## Input/Output Specifications
216
 
 
291
 
292
  *System Requirements and Performance**
293
 
294
+ Video2World (720p, 16FPS): This model requires 56.38 GB of GPU VRAM. The following table shows inference time for a single generation across different NVIDIA GPU hardware:
295
 
296
  | GPU Hardware | Inference Runtime |
297
  | ---------------------- | ----------------- |
298
+ | H100 SXM | 856.9 s |
299
+ | H200 SXM | 836.9 s |
300
+ | B200 | 439.4 s |
301
+ | H100 NVL | 1348.6 s |
302
+ | H100 PCIe | 1425.4 s |
303
+ | H200 NVL | 1006.7 s |
304
+ | L40S | OOM |
305
+ | RTX PRO 6000 Blackwell | 1700.3 s |
306
 
307
  **Operating System(s):**
308
  * Linux (We have not tested on other operating systems.)
 
321
  ## Inference:
322
  **Acceleration Engine**: [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
323
 
324
+ **Test Hardware:** H100, A100, B200
325
 
326
  ## Ethical Considerations
327