File size: 8,148 Bytes
3045173
 
 
 
 
 
f138eeb
 
3045173
 
c8df74e
f138eeb
3045173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8df74e
af32784
3045173
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
license: other
library_name: diffusers
pipeline_tag: text-to-video
tags:
  - wan
  - text-to-video
  - image-generation
---

<!-- README Version: v1.3 -->

# WAN2.1 VAE - 3D Causal Video Variational Autoencoder

WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework.

## Model Description

The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring:

- **3D Causal Architecture**: Maintains temporal causality across video sequences
- **Unlimited Length Support**: Can encode and decode unlimited-length 1080P videos without losing historical temporal information
- **High Compression Efficiency**: Advanced spatio-temporal compression with minimal quality loss
- **Memory Optimized**: Reduced memory footprint compared to traditional video VAEs
- **Temporal Information Preservation**: Ensures consistent temporal dynamics across long sequences

### Key Innovations

1. **Improved Spatio-Temporal Compression**: Enhanced compression ratios while maintaining visual fidelity
2. **Causal Temporal Processing**: Ensures frame-to-frame causality for coherent video generation
3. **Efficient Memory Usage**: Optimized for consumer-grade GPU deployment
4. **High-Resolution Support**: Native support for 1080P video encoding/decoding

## Repository Contents

```
E:\huggingface\wan21-vae\
└── vae/
    └── wan/
        └── wan21-vae.safetensors (243 MB)
```

### Model Files

| File | Size | Format | Description |
|------|------|--------|-------------|
| `wan21-vae.safetensors` | 243 MB | SafeTensors | WAN2.1 VAE weights |

**Total Repository Size**: 243 MB

## Hardware Requirements

### Minimum Requirements
- **VRAM**: 4 GB (inference only)
- **RAM**: 8 GB system memory
- **Disk Space**: 500 MB (including dependencies)
- **GPU**: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent)

### Recommended Requirements
- **VRAM**: 8+ GB for optimal performance
- **RAM**: 16 GB system memory
- **Disk Space**: 1 GB
- **GPU**: NVIDIA RTX 3060 or better

### Resolution-Specific Requirements
- **480P Video**: 4-6 GB VRAM
- **720P Video**: 6-8 GB VRAM
- **1080P Video**: 8-12 GB VRAM

## Usage Examples

### Basic VAE Loading

```python
import torch
from diffusers import AutoencoderKL

# Load the WAN2.1 VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
).to("cuda")

print(f"VAE loaded: {vae.config}")
```

### Video Encoding Example

```python
import torch
from diffusers import AutoencoderKL
from PIL import Image
import numpy as np

# Load VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
).to("cuda")

# Prepare video frames (example with dummy data)
# Shape: [batch, channels, frames, height, width]
video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda")

# Encode video to latent space
with torch.no_grad():
    latents = vae.encode(video_frames).latent_dist.sample()

print(f"Latent shape: {latents.shape}")
print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x")
```

### Video Decoding Example

```python
import torch
from diffusers import AutoencoderKL

# Load VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
).to("cuda")

# Decode latents back to video frames
# Assuming you have latents from encoding step
with torch.no_grad():
    reconstructed_video = vae.decode(latents).sample

print(f"Reconstructed video shape: {reconstructed_video.shape}")
```

### Integration with WAN Models

```python
import torch
from diffusers import DiffusionPipeline, AutoencoderKL

# Load custom VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
)

# Load WAN model with custom VAE
pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B",
    vae=vae,
    torch_dtype=torch.float16
).to("cuda")

# Generate video
prompt = "A serene beach at sunset with waves crashing"
video = pipe(prompt, num_frames=16, height=480, width=720).frames

print(f"Generated video: {len(video)} frames")
```

## Model Specifications

### Architecture Details
- **Type**: 3D Causal Variational Autoencoder
- **Architecture**: Causal spatio-temporal convolutions
- **Compression**: Variable compression ratios (4x, 8x, 16x depending on configuration)
- **Causality**: Temporal causal processing for frame consistency
- **Latent Dimensions**: Optimized for video generation tasks

### Technical Specifications
- **Precision**: FP16 (Half precision) recommended
- **Format**: SafeTensors (secure, efficient loading)
- **Framework**: PyTorch >= 2.4.0
- **Library**: Diffusers (Hugging Face)
- **Temporal Support**: Unlimited frame sequences
- **Resolution Support**: Up to 1080P native

### Supported Operations
- Video encoding (frames β†’ latents)
- Video decoding (latents β†’ frames)
- Temporal compression
- Spatial compression
- Causal frame generation

## Performance Tips and Optimization

### Memory Optimization
```python
# Use gradient checkpointing for lower memory usage
vae.enable_gradient_checkpointing()

# Use CPU offloading for very large videos
vae.enable_sequential_cpu_offload()

# Use attention slicing for reduced VRAM
vae.enable_attention_slicing(1)
```

### Speed Optimization
```python
# Compile model for faster inference (PyTorch 2.0+)
vae = torch.compile(vae, mode="reduce-overhead")

# Use xFormers for efficient attention
vae.enable_xformers_memory_efficient_attention()

# Use half precision for faster inference
vae = vae.half()
```

### Batch Processing
```python
# Process multiple video clips efficiently
batch_size = 4
video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda")

with torch.no_grad():
    latents = vae.encode(video_clips).latent_dist.sample()
```

### Resolution Guidelines
- **480P (854Γ—480)**: Best for real-time applications, lowest VRAM
- **720P (1280Γ—720)**: Balanced quality and performance
- **1080P (1920Γ—1080)**: Maximum quality, requires high-end GPU

## License

This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions.

**License Type**: Other (Custom WAN License)

### Usage Restrictions
- Check official WAN-AI repository for commercial usage terms
- Attribution required for research and non-commercial use
- Refer to [WAN-AI Organization](https://huggingface.co/Wan-AI) for updates

## Citation

If you use this VAE in your research or applications, please cite the WAN project:

```bibtex
@misc{wan2025,
  title={WAN: Open and Advanced Large-Scale Video Generative Models},
  author={WAN-AI Team},
  year={2025},
  publisher={Hugging Face},
  howpublished={https://huggingface.co/Wan-AI}
}
```

## Related Resources

### Official Links
- **WAN Organization**: https://huggingface.co/Wan-AI
- **WAN2.1 T2V 1.3B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
- **WAN2.1 T2V 14B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
- **WAN2.2 Models**: https://huggingface.co/Wan-AI (Latest versions)
- **GitHub Repository**: https://github.com/Wan-Video

### Related Models
- **WAN2.2 VAE**: Latest VAE with 64x compression (4Γ—16Γ—16)
- **WAN2.1 T2V**: Text-to-video generation models
- **WAN2.1 I2V**: Image-to-video generation models
- **WAN2.2 Animate**: Character animation models

### Community & Support
- Hugging Face WAN-AI discussions
- GitHub issues and community forums
- Research papers and technical documentation

## Model Card Contact

For questions, issues, or collaboration inquiries:
- Visit the [WAN-AI Hugging Face Organization](https://huggingface.co/Wan-AI)
- Check the [official GitHub repository](https://github.com/Wan-Video)
- Review model-specific documentation on individual model cards

---

**Version**: v1.3
**Last Updated**: 2025-10-14
**Model Size**: 243 MB
**Format**: SafeTensors