| # Cosmos-Predict2.5-2B (Diffusers Format) | |
| This is the NVIDIA Cosmos Predict 2.5 2B model in Diffusers-compatible format for use with FastVideo. | |
| ## Model Components | |
| This model consists of the following components: | |
| ### 1. Transformer (DiT) | |
| - **Class**: `Cosmos25Transformer3DModel` | |
| - **Architecture**: 28 layers, 16 attention heads, 128 head dim | |
| - **Parameters**: ~2B parameters | |
| - **Input channels**: 16 (latent space) | |
| - **Patch size**: (1, 2, 2) for temporal and spatial dimensions | |
| - **Features**: | |
| - AdaLN-LoRA conditioning (dim=256) | |
| - RoPE positional embeddings with 3D scaling | |
| - Cross-attention projection for text conditioning | |
| - RMS normalization for Q/K | |
| ### 2. VAE (Wan2.1) | |
| - **Class**: `AutoencoderKLWan` | |
| - **Latent channels**: 16 | |
| - **Compression**: 8x spatial, 4x temporal | |
| - **Architecture**: 4-stage encoder/decoder with residual blocks | |
| - **Features**: | |
| - Feature caching for efficiency | |
| - Configurable tiling support | |
| - Clip output to [-1, 1] | |
| ### 3. Scheduler | |
| - **Class**: `FlowUniPCMultistepScheduler` | |
| - **Type**: Multi-step flow matching solver (UniPC) | |
| - **Order**: 2 (predictor-corrector) | |
| - **Configuration**: | |
| - Training timesteps: 1000 | |
| - Shift: 1 | |
| - No dynamic shifting | |
| - Solver type: bh2 (recommended for >10 steps) | |
| ### 4. Text Encoder & Tokenizer | |
| - **Note**: Text encoder and tokenizer are not included in this directory | |
| - **Official Implementation**: Uses Reason1 or official TextEncoder from `cosmos_predict2` | |
| - **Expected format**: Text embeddings with shape (batch, 512, 100352) | |
| ## Directory Structure | |
| ``` | |
| models--nvidia--Cosmos-Predict2.5-2B-Diffusers/ | |
| βββ model_index.json # Pipeline component registry | |
| βββ README.md # This file | |
| βββ transformer/ | |
| β βββ config.json # Transformer configuration | |
| β βββ 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt # Model weights | |
| βββ vae/ | |
| β βββ config.json # VAE configuration | |
| β βββ tokenizer.pth # VAE weights | |
| βββ scheduler/ | |
| βββ scheduler_config.json # Scheduler configuration | |
| ``` | |
| ## Usage with FastVideo | |
| ### Option 1: Using FastVideo Pipeline (Recommended) | |
| ```python | |
| from fastvideo import FastVideoArgs | |
| from fastvideo.pipelines.basic.cosmos.cosmos2_5_pipeline import Cosmos2_5Pipeline | |
| # Initialize pipeline | |
| args = FastVideoArgs.from_cli_args(model="nvidia/Cosmos-Predict2.5-2B-Diffusers") | |
| pipeline = Cosmos2_5Pipeline(args) | |
| # Generate video | |
| output = pipeline( | |
| prompt="A robot welding in an industrial setting", | |
| height=480, | |
| width=832, | |
| num_frames=121, | |
| num_inference_steps=35, | |
| guidance_scale=7.0, | |
| ) | |
| ``` | |
| ### Option 2: Manual Component Loading | |
| ```python | |
| from fastvideo.models.dits.cosmos2_5 import Cosmos25Transformer3DModel | |
| from fastvideo.models.vaes.wanvae import AutoencoderKLWan | |
| from fastvideo.models.schedulers.scheduling_flow_unipc_multistep import FlowUniPCMultistepScheduler | |
| # Load components | |
| transformer = Cosmos25Transformer3DModel.from_pretrained( | |
| "nvidia/Cosmos-Predict2.5-2B-Diffusers", | |
| subfolder="transformer" | |
| ) | |
| vae = AutoencoderKLWan.from_pretrained( | |
| "nvidia/Cosmos-Predict2.5-2B-Diffusers", | |
| subfolder="vae" | |
| ) | |
| scheduler = FlowUniPCMultistepScheduler.from_pretrained( | |
| "nvidia/Cosmos-Predict2.5-2B-Diffusers", | |
| subfolder="scheduler" | |
| ) | |
| ``` | |
| ## Key Differences from Official | |
| 1. **Scheduler**: This model uses `FlowUniPCMultistepScheduler` (multi-step) which matches the official Cosmos 2.5 implementation, NOT `FlowMatchEulerDiscreteScheduler` (single-step) used in some FastVideo examples. | |
| 2. **Weight Format**: Uses FastVideo-compatible weight format with proper key mapping. | |
| 3. **Configuration**: All hyperparameters match the official Cosmos 2.5 2B model. | |
| ## Inference Parameters | |
| Recommended settings for best quality: | |
| - **Resolution**: 480x832 (or multiples of 16) | |
| - **Frames**: 121 (or any compatible length) | |
| - **Steps**: 35 (with UniPC scheduler) | |
| - **Guidance Scale**: 7.0 | |
| - **Scheduler Shift**: 5.0 (dynamic, applied during inference) | |
| - **FPS**: 24.0 | |
| ## Model Information | |
| - **Model Size**: ~2B parameters (transformer only) | |
| - **Precision**: BFloat16 | |
| - **Context**: Trained for video prediction/generation | |
| - **License**: Check NVIDIA's official license for Cosmos models | |
| ## Citation | |
| If you use this model, please cite: | |
| ```bibtex | |
| @misc{cosmos2024, | |
| title={Cosmos: Foundation Models for Video Generation}, | |
| author={NVIDIA}, | |
| year={2024} | |
| } | |
| ``` | |
| ## Notes | |
| 1. This is a Diffusers-compatible format but uses FastVideo classes, not standard Diffusers classes. | |
| 2. The text encoder component needs to be loaded separately from the official cosmos_predict2 package. | |
| 3. For best results, use the same scheduler (FlowUniPCMultistepScheduler) that the official model uses. | |
| 4. The model expects text embeddings in the shape (batch, 512, 100352) - make sure your text encoder produces this format. | |