nvidia
/

Cosmos-Predict2.5-14B

@@ -166,30 +166,14 @@ This model is ready for commercial/non-commercial use.
 The Cosmos-Predict2.5 diffusion-based model family includes the following models:
-- Cosmos-Predict2.5-2B/ Pre-trained
   - Given a text description, an image as the first frame, and/or a video, predict the future frames.
   - Produces 720P video with 16FPS
-- Cosmos-Predict2.5-2B/ Post-trained
   - Given a text description, an image as the first frame, and/or a video, predict the future frames.
   - Produces 720P video with 16FPS
-- Cosmos-Predict2.5-2B/ Auto/ Multiview
-  - Given a text description, an image as the first frame, and/or a video, predict world senario in 7-camera views .
-  - Produces 720P video with 16FPS
-- Cosmos-Predict2.5-2B/ Robot / Multiview
-  - Given a text description, a static video, and two target camera trajectories, predict two re-rendered videos.
-  - Produces 720P video with 16FPS
-- Cosmos-Predict2.5-2B/ Robot / Multiview-Agibot
-  - Given a text description, a head-view video, and two target hand-view camera trajectories, predict two head-view videos.
-  - Produces 720P video with 16FPS
-- Cosmos-Predict2.5-2B/ Robot / Action-Cond
-  - Given image as the first frame and a robot action sequence as condition, predict the future frames.
-  - Produces 256p video with 4FPS
 ### License
 This model is released under the  [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
@@ -216,17 +200,17 @@ Physical AI: encompassing robotics, autonomous vehicles (AV), and more.
 ### Release Date:
-Github [10/06/2025] via https://github.com/nvidia-cosmos/cosmos-predict2.5
-Hugging Face [10/06/2025] via https://huggingface.co/collections/nvidia/cosmos-predict25-68bb63255f2fc206c5e5b346
 ## Model Architecture
-Cosmos-Predict2.5-2B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
-**This model was developed based on:**  [Cosmos-Predict2-2B](https://huggingface.co/nvidia/Cosmos-Predict2-2B-Video2World)
-**Number of model parameters:** 2,059,174,912
 ## Input/Output Specifications
@@ -307,18 +291,18 @@ Please see our [technical paper](https://research.nvidia.com/publication/2025-09
 *System Requirements and Performance**
-Video2World (720p, 16FPS): This model requires 32.54 GB of GPU VRAM. The following table shows inference time for a single generation across different NVIDIA GPU hardware:
 | GPU Hardware           | Inference Runtime |
 | ---------------------- | ----------------- |
-| H100 SXM               | 228.8  s          |
-| H200 SXM               | 221.7  s          |
-| B200                   | 123.9  s          |
-| H100 NVL               | 355.7  s          |
-| H100 PCIe              | 378.5  s          |
-| H200 NVL               | 267.2  s          |
-| L40S                   | 2567.1 s          |
-| RTX PRO 6000 Blackwell | 452.2  s          |
 **Operating System(s):**
 * Linux (We have not tested on other operating systems.)
@@ -337,7 +321,7 @@ Despite various improvements in world generation for Physical AI, Cosmos-Predict
 ## Inference:
 **Acceleration Engine**: [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
-**Test Hardware:** H100, A100, GB200
 ## Ethical Considerations

 The Cosmos-Predict2.5 diffusion-based model family includes the following models:
+- Cosmos-Predict2.5-14B/ Pre-trained
   - Given a text description, an image as the first frame, and/or a video, predict the future frames.
   - Produces 720P video with 16FPS
+- Cosmos-Predict2.5-14B/ Post-trained
   - Given a text description, an image as the first frame, and/or a video, predict the future frames.
   - Produces 720P video with 16FPS
 ### License
 This model is released under the  [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license). Additional Information: [Apache License 2.0](https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B/blob/main/LICENSE).
 ### Release Date:
+Github [12/04/2025] via https://github.com/nvidia-cosmos/cosmos-predict2.5
+Hugging Face [12/04/2025] via https://huggingface.co/collections/nvidia/cosmos-predict25
 ## Model Architecture
+Cosmos-Predict2.5-14B is a diffusion transformer model designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.
+**This model was developed based on:**  [Cosmos-Predict2-14B](https://huggingface.co/nvidia/Cosmos-Predict2-14B-Video2World)
+**Number of model parameters:** 14,000,000,000
 ## Input/Output Specifications
 *System Requirements and Performance**
+Video2World (720p, 16FPS): This model requires 56.38 GB of GPU VRAM. The following table shows inference time for a single generation across different NVIDIA GPU hardware:
 | GPU Hardware           | Inference Runtime |
 | ---------------------- | ----------------- |
+| H100 SXM               | 856.9  s          |
+| H200 SXM               | 836.9  s          |
+| B200                   | 439.4  s          |
+| H100 NVL               | 1348.6 s          |
+| H100 PCIe              | 1425.4 s          |
+| H200 NVL               | 1006.7 s          |
+| L40S                   | OOM               |
+| RTX PRO 6000 Blackwell | 1700.3 s          |
 **Operating System(s):**
 * Linux (We have not tested on other operating systems.)
 ## Inference:
 **Acceleration Engine**: [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
+**Test Hardware:** H100, A100, B200
 ## Ethical Considerations