File size: 4,929 Bytes

a63d81a

# Cosmos-Predict2.5-2B (Diffusers Format)

This is the NVIDIA Cosmos Predict 2.5 2B model in Diffusers-compatible format for use with FastVideo.

## Model Components

This model consists of the following components:

### 1. Transformer (DiT)
- **Class**: `Cosmos25Transformer3DModel`
- **Architecture**: 28 layers, 16 attention heads, 128 head dim
- **Parameters**: ~2B parameters
- **Input channels**: 16 (latent space)
- **Patch size**: (1, 2, 2) for temporal and spatial dimensions
- **Features**: 
  - AdaLN-LoRA conditioning (dim=256)
  - RoPE positional embeddings with 3D scaling
  - Cross-attention projection for text conditioning
  - RMS normalization for Q/K

### 2. VAE (Wan2.1)
- **Class**: `AutoencoderKLWan`
- **Latent channels**: 16
- **Compression**: 8x spatial, 4x temporal
- **Architecture**: 4-stage encoder/decoder with residual blocks
- **Features**:
  - Feature caching for efficiency
  - Configurable tiling support
  - Clip output to [-1, 1]

### 3. Scheduler
- **Class**: `FlowUniPCMultistepScheduler`
- **Type**: Multi-step flow matching solver (UniPC)
- **Order**: 2 (predictor-corrector)
- **Configuration**:
  - Training timesteps: 1000
  - Shift: 1
  - No dynamic shifting
  - Solver type: bh2 (recommended for >10 steps)

### 4. Text Encoder & Tokenizer
- **Note**: Text encoder and tokenizer are not included in this directory
- **Official Implementation**: Uses Reason1 or official TextEncoder from `cosmos_predict2`
- **Expected format**: Text embeddings with shape (batch, 512, 100352)

## Directory Structure

```
models--nvidia--Cosmos-Predict2.5-2B-Diffusers/
├── model_index.json              # Pipeline component registry
├── README.md                     # This file
├── transformer/
│   ├── config.json              # Transformer configuration
│   └── 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt  # Model weights
├── vae/
│   ├── config.json              # VAE configuration
│   └── tokenizer.pth            # VAE weights
└── scheduler/
    └── scheduler_config.json    # Scheduler configuration
```

## Usage with FastVideo

### Option 1: Using FastVideo Pipeline (Recommended)

```python
from fastvideo import FastVideoArgs
from fastvideo.pipelines.basic.cosmos.cosmos2_5_pipeline import Cosmos2_5Pipeline

# Initialize pipeline
args = FastVideoArgs.from_cli_args(model="nvidia/Cosmos-Predict2.5-2B-Diffusers")
pipeline = Cosmos2_5Pipeline(args)

# Generate video
output = pipeline(
    prompt="A robot welding in an industrial setting",
    height=480,
    width=832,
    num_frames=121,
    num_inference_steps=35,
    guidance_scale=7.0,
)
```

### Option 2: Manual Component Loading

```python
from fastvideo.models.dits.cosmos2_5 import Cosmos25Transformer3DModel
from fastvideo.models.vaes.wanvae import AutoencoderKLWan
from fastvideo.models.schedulers.scheduling_flow_unipc_multistep import FlowUniPCMultistepScheduler

# Load components
transformer = Cosmos25Transformer3DModel.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B-Diffusers",
    subfolder="transformer"
)

vae = AutoencoderKLWan.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B-Diffusers",
    subfolder="vae"
)

scheduler = FlowUniPCMultistepScheduler.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B-Diffusers",
    subfolder="scheduler"
)
```

## Key Differences from Official

1. **Scheduler**: This model uses `FlowUniPCMultistepScheduler` (multi-step) which matches the official Cosmos 2.5 implementation, NOT `FlowMatchEulerDiscreteScheduler` (single-step) used in some FastVideo examples.

2. **Weight Format**: Uses FastVideo-compatible weight format with proper key mapping.

3. **Configuration**: All hyperparameters match the official Cosmos 2.5 2B model.

## Inference Parameters

Recommended settings for best quality:

- **Resolution**: 480x832 (or multiples of 16)
- **Frames**: 121 (or any compatible length)
- **Steps**: 35 (with UniPC scheduler)
- **Guidance Scale**: 7.0
- **Scheduler Shift**: 5.0 (dynamic, applied during inference)
- **FPS**: 24.0

## Model Information

- **Model Size**: ~2B parameters (transformer only)
- **Precision**: BFloat16
- **Context**: Trained for video prediction/generation
- **License**: Check NVIDIA's official license for Cosmos models

## Citation

If you use this model, please cite:

```bibtex
@misc{cosmos2024,
  title={Cosmos: Foundation Models for Video Generation},
  author={NVIDIA},
  year={2024}
}
```

## Notes

1. This is a Diffusers-compatible format but uses FastVideo classes, not standard Diffusers classes.
2. The text encoder component needs to be loaded separately from the official cosmos_predict2 package.
3. For best results, use the same scheduler (FlowUniPCMultistepScheduler) that the official model uses.
4. The model expects text embeddings in the shape (batch, 512, 100352) - make sure your text encoder produces this format.