File size: 4,929 Bytes
a63d81a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# Cosmos-Predict2.5-2B (Diffusers Format)

This is the NVIDIA Cosmos Predict 2.5 2B model in Diffusers-compatible format for use with FastVideo.

## Model Components

This model consists of the following components:

### 1. Transformer (DiT)
- **Class**: `Cosmos25Transformer3DModel`
- **Architecture**: 28 layers, 16 attention heads, 128 head dim
- **Parameters**: ~2B parameters
- **Input channels**: 16 (latent space)
- **Patch size**: (1, 2, 2) for temporal and spatial dimensions
- **Features**: 
  - AdaLN-LoRA conditioning (dim=256)
  - RoPE positional embeddings with 3D scaling
  - Cross-attention projection for text conditioning
  - RMS normalization for Q/K

### 2. VAE (Wan2.1)
- **Class**: `AutoencoderKLWan`
- **Latent channels**: 16
- **Compression**: 8x spatial, 4x temporal
- **Architecture**: 4-stage encoder/decoder with residual blocks
- **Features**:
  - Feature caching for efficiency
  - Configurable tiling support
  - Clip output to [-1, 1]

### 3. Scheduler
- **Class**: `FlowUniPCMultistepScheduler`
- **Type**: Multi-step flow matching solver (UniPC)
- **Order**: 2 (predictor-corrector)
- **Configuration**:
  - Training timesteps: 1000
  - Shift: 1
  - No dynamic shifting
  - Solver type: bh2 (recommended for >10 steps)

### 4. Text Encoder & Tokenizer
- **Note**: Text encoder and tokenizer are not included in this directory
- **Official Implementation**: Uses Reason1 or official TextEncoder from `cosmos_predict2`
- **Expected format**: Text embeddings with shape (batch, 512, 100352)

## Directory Structure

```
models--nvidia--Cosmos-Predict2.5-2B-Diffusers/
β”œβ”€β”€ model_index.json              # Pipeline component registry
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ transformer/
β”‚   β”œβ”€β”€ config.json              # Transformer configuration
β”‚   └── 81edfebe-bd6a-4039-8c1d-737df1a790bf_ema_bf16.pt  # Model weights
β”œβ”€β”€ vae/
β”‚   β”œβ”€β”€ config.json              # VAE configuration
β”‚   └── tokenizer.pth            # VAE weights
└── scheduler/
    └── scheduler_config.json    # Scheduler configuration
```

## Usage with FastVideo

### Option 1: Using FastVideo Pipeline (Recommended)

```python
from fastvideo import FastVideoArgs
from fastvideo.pipelines.basic.cosmos.cosmos2_5_pipeline import Cosmos2_5Pipeline

# Initialize pipeline
args = FastVideoArgs.from_cli_args(model="nvidia/Cosmos-Predict2.5-2B-Diffusers")
pipeline = Cosmos2_5Pipeline(args)

# Generate video
output = pipeline(
    prompt="A robot welding in an industrial setting",
    height=480,
    width=832,
    num_frames=121,
    num_inference_steps=35,
    guidance_scale=7.0,
)
```

### Option 2: Manual Component Loading

```python
from fastvideo.models.dits.cosmos2_5 import Cosmos25Transformer3DModel
from fastvideo.models.vaes.wanvae import AutoencoderKLWan
from fastvideo.models.schedulers.scheduling_flow_unipc_multistep import FlowUniPCMultistepScheduler

# Load components
transformer = Cosmos25Transformer3DModel.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B-Diffusers",
    subfolder="transformer"
)

vae = AutoencoderKLWan.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B-Diffusers",
    subfolder="vae"
)

scheduler = FlowUniPCMultistepScheduler.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B-Diffusers",
    subfolder="scheduler"
)
```

## Key Differences from Official

1. **Scheduler**: This model uses `FlowUniPCMultistepScheduler` (multi-step) which matches the official Cosmos 2.5 implementation, NOT `FlowMatchEulerDiscreteScheduler` (single-step) used in some FastVideo examples.

2. **Weight Format**: Uses FastVideo-compatible weight format with proper key mapping.

3. **Configuration**: All hyperparameters match the official Cosmos 2.5 2B model.

## Inference Parameters

Recommended settings for best quality:

- **Resolution**: 480x832 (or multiples of 16)
- **Frames**: 121 (or any compatible length)
- **Steps**: 35 (with UniPC scheduler)
- **Guidance Scale**: 7.0
- **Scheduler Shift**: 5.0 (dynamic, applied during inference)
- **FPS**: 24.0

## Model Information

- **Model Size**: ~2B parameters (transformer only)
- **Precision**: BFloat16
- **Context**: Trained for video prediction/generation
- **License**: Check NVIDIA's official license for Cosmos models

## Citation

If you use this model, please cite:

```bibtex
@misc{cosmos2024,
  title={Cosmos: Foundation Models for Video Generation},
  author={NVIDIA},
  year={2024}
}
```

## Notes

1. This is a Diffusers-compatible format but uses FastVideo classes, not standard Diffusers classes.
2. The text encoder component needs to be loaded separately from the official cosmos_predict2 package.
3. For best results, use the same scheduler (FlowUniPCMultistepScheduler) that the official model uses.
4. The model expects text embeddings in the shape (batch, 512, 100352) - make sure your text encoder produces this format.