# Cosmos-Predict2.5-2B (Diffusers Format) This is the NVIDIA Cosmos Predict 2.5 2B model in Diffusers-compatible format for use with FastVideo. ## Model Components This model consists of the following components: ### 1. Transformer (DiT) - **Class**: `Cosmos25Transformer3DModel` - **Architecture**: 28 layers, 16 attention heads, 128 head dim - **Parameters**: ~2B parameters - **Input channels**: 16 (latent space) - **Patch size**: (1, 2, 2) for temporal and spatial dimensions - **Features**: - AdaLN-LoRA conditioning (dim=256) - RoPE positional embeddings with 3D scaling - Cross-attention projection for text conditioning - RMS normalization for Q/K ### 2. VAE (Wan2.1) - **Class**: `AutoencoderKLWan` - **Latent channels**: 16 - **Compression**: 8x spatial, 4x temporal - **Architecture**: 4-stage encoder/decoder with residual blocks - **Features**: - Feature caching for efficiency - Configurable tiling support - Clip output to [-1, 1] ### 3. Scheduler - **Class**: `FlowUniPCMultistepScheduler` - **Type**: Multi-step flow matching solver (UniPC) - **Order**: 2 (predictor-corrector) - **Configuration**: - Training timesteps: 1000 - Shift: 1 - No dynamic shifting - Solver type: bh2 (recommended for >10 steps) ### 4. Text Encoder & Tokenizer - **Note**: Text encoder and tokenizer are not included in this directory - **Official Implementation**: Uses Reason1 or official TextEncoder from `cosmos_predict2` - **Expected format**: Text embeddings with shape (batch, 512, 100352) ## Directory Structure ``` models--nvidia--Cosmos-Predict2.5-2B-Diffusers/ ├── model_index.json # Pipeline component registry ├── README.md # This file ├── transformer/ │ ├── config.json # Transformer configuration │ └── 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt # Model weights ├── vae/ │ ├── config.json # VAE configuration │ └── tokenizer.pth # VAE weights └── scheduler/ └── scheduler_config.json # Scheduler configuration ``` ## Usage with FastVideo ### Option 1: Using FastVideo Pipeline (Recommended) ```python from fastvideo import FastVideoArgs from fastvideo.pipelines.basic.cosmos.cosmos2_5_pipeline import Cosmos2_5Pipeline # Initialize pipeline args = FastVideoArgs.from_cli_args(model="nvidia/Cosmos-Predict2.5-2B-Diffusers") pipeline = Cosmos2_5Pipeline(args) # Generate video output = pipeline( prompt="A robot welding in an industrial setting", height=480, width=832, num_frames=121, num_inference_steps=35, guidance_scale=7.0, ) ``` ### Option 2: Manual Component Loading ```python from fastvideo.models.dits.cosmos2_5 import Cosmos25Transformer3DModel from fastvideo.models.vaes.wanvae import AutoencoderKLWan from fastvideo.models.schedulers.scheduling_flow_unipc_multistep import FlowUniPCMultistepScheduler # Load components transformer = Cosmos25Transformer3DModel.from_pretrained( "nvidia/Cosmos-Predict2.5-2B-Diffusers", subfolder="transformer" ) vae = AutoencoderKLWan.from_pretrained( "nvidia/Cosmos-Predict2.5-2B-Diffusers", subfolder="vae" ) scheduler = FlowUniPCMultistepScheduler.from_pretrained( "nvidia/Cosmos-Predict2.5-2B-Diffusers", subfolder="scheduler" ) ``` ## Key Differences from Official 1. **Scheduler**: This model uses `FlowUniPCMultistepScheduler` (multi-step) which matches the official Cosmos 2.5 implementation, NOT `FlowMatchEulerDiscreteScheduler` (single-step) used in some FastVideo examples. 2. **Weight Format**: Uses FastVideo-compatible weight format with proper key mapping. 3. **Configuration**: All hyperparameters match the official Cosmos 2.5 2B model. ## Inference Parameters Recommended settings for best quality: - **Resolution**: 480x832 (or multiples of 16) - **Frames**: 121 (or any compatible length) - **Steps**: 35 (with UniPC scheduler) - **Guidance Scale**: 7.0 - **Scheduler Shift**: 5.0 (dynamic, applied during inference) - **FPS**: 24.0 ## Model Information - **Model Size**: ~2B parameters (transformer only) - **Precision**: BFloat16 - **Context**: Trained for video prediction/generation - **License**: Check NVIDIA's official license for Cosmos models ## Citation If you use this model, please cite: ```bibtex @misc{cosmos2024, title={Cosmos: Foundation Models for Video Generation}, author={NVIDIA}, year={2024} } ``` ## Notes 1. This is a Diffusers-compatible format but uses FastVideo classes, not standard Diffusers classes. 2. The text encoder component needs to be loaded separately from the official cosmos_predict2 package. 3. For best results, use the same scheduler (FlowUniPCMultistepScheduler) that the official model uses. 4. The model expects text embeddings in the shape (batch, 512, 100352) - make sure your text encoder produces this format.