KyleShao's picture
Upload folder using huggingface_hub
a63d81a verified
# Cosmos-Predict2.5-2B (Diffusers Format)
This is the NVIDIA Cosmos Predict 2.5 2B model in Diffusers-compatible format for use with FastVideo.
## Model Components
This model consists of the following components:
### 1. Transformer (DiT)
- **Class**: `Cosmos25Transformer3DModel`
- **Architecture**: 28 layers, 16 attention heads, 128 head dim
- **Parameters**: ~2B parameters
- **Input channels**: 16 (latent space)
- **Patch size**: (1, 2, 2) for temporal and spatial dimensions
- **Features**:
- AdaLN-LoRA conditioning (dim=256)
- RoPE positional embeddings with 3D scaling
- Cross-attention projection for text conditioning
- RMS normalization for Q/K
### 2. VAE (Wan2.1)
- **Class**: `AutoencoderKLWan`
- **Latent channels**: 16
- **Compression**: 8x spatial, 4x temporal
- **Architecture**: 4-stage encoder/decoder with residual blocks
- **Features**:
- Feature caching for efficiency
- Configurable tiling support
- Clip output to [-1, 1]
### 3. Scheduler
- **Class**: `FlowUniPCMultistepScheduler`
- **Type**: Multi-step flow matching solver (UniPC)
- **Order**: 2 (predictor-corrector)
- **Configuration**:
- Training timesteps: 1000
- Shift: 1
- No dynamic shifting
- Solver type: bh2 (recommended for >10 steps)
### 4. Text Encoder & Tokenizer
- **Note**: Text encoder and tokenizer are not included in this directory
- **Official Implementation**: Uses Reason1 or official TextEncoder from `cosmos_predict2`
- **Expected format**: Text embeddings with shape (batch, 512, 100352)
## Directory Structure
```
models--nvidia--Cosmos-Predict2.5-2B-Diffusers/
β”œβ”€β”€ model_index.json # Pipeline component registry
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ transformer/
β”‚ β”œβ”€β”€ config.json # Transformer configuration
β”‚ └── 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt # Model weights
β”œβ”€β”€ vae/
β”‚ β”œβ”€β”€ config.json # VAE configuration
β”‚ └── tokenizer.pth # VAE weights
└── scheduler/
└── scheduler_config.json # Scheduler configuration
```
## Usage with FastVideo
### Option 1: Using FastVideo Pipeline (Recommended)
```python
from fastvideo import FastVideoArgs
from fastvideo.pipelines.basic.cosmos.cosmos2_5_pipeline import Cosmos2_5Pipeline
# Initialize pipeline
args = FastVideoArgs.from_cli_args(model="nvidia/Cosmos-Predict2.5-2B-Diffusers")
pipeline = Cosmos2_5Pipeline(args)
# Generate video
output = pipeline(
prompt="A robot welding in an industrial setting",
height=480,
width=832,
num_frames=121,
num_inference_steps=35,
guidance_scale=7.0,
)
```
### Option 2: Manual Component Loading
```python
from fastvideo.models.dits.cosmos2_5 import Cosmos25Transformer3DModel
from fastvideo.models.vaes.wanvae import AutoencoderKLWan
from fastvideo.models.schedulers.scheduling_flow_unipc_multistep import FlowUniPCMultistepScheduler
# Load components
transformer = Cosmos25Transformer3DModel.from_pretrained(
"nvidia/Cosmos-Predict2.5-2B-Diffusers",
subfolder="transformer"
)
vae = AutoencoderKLWan.from_pretrained(
"nvidia/Cosmos-Predict2.5-2B-Diffusers",
subfolder="vae"
)
scheduler = FlowUniPCMultistepScheduler.from_pretrained(
"nvidia/Cosmos-Predict2.5-2B-Diffusers",
subfolder="scheduler"
)
```
## Key Differences from Official
1. **Scheduler**: This model uses `FlowUniPCMultistepScheduler` (multi-step) which matches the official Cosmos 2.5 implementation, NOT `FlowMatchEulerDiscreteScheduler` (single-step) used in some FastVideo examples.
2. **Weight Format**: Uses FastVideo-compatible weight format with proper key mapping.
3. **Configuration**: All hyperparameters match the official Cosmos 2.5 2B model.
## Inference Parameters
Recommended settings for best quality:
- **Resolution**: 480x832 (or multiples of 16)
- **Frames**: 121 (or any compatible length)
- **Steps**: 35 (with UniPC scheduler)
- **Guidance Scale**: 7.0
- **Scheduler Shift**: 5.0 (dynamic, applied during inference)
- **FPS**: 24.0
## Model Information
- **Model Size**: ~2B parameters (transformer only)
- **Precision**: BFloat16
- **Context**: Trained for video prediction/generation
- **License**: Check NVIDIA's official license for Cosmos models
## Citation
If you use this model, please cite:
```bibtex
@misc{cosmos2024,
title={Cosmos: Foundation Models for Video Generation},
author={NVIDIA},
year={2024}
}
```
## Notes
1. This is a Diffusers-compatible format but uses FastVideo classes, not standard Diffusers classes.
2. The text encoder component needs to be loaded separately from the official cosmos_predict2 package.
3. For best results, use the same scheduler (FlowUniPCMultistepScheduler) that the official model uses.
4. The model expects text embeddings in the shape (batch, 512, 100352) - make sure your text encoder produces this format.