File size: 8,148 Bytes
3045173 f138eeb 3045173 c8df74e f138eeb 3045173 c8df74e af32784 3045173 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 |
---
license: other
library_name: diffusers
pipeline_tag: text-to-video
tags:
- wan
- text-to-video
- image-generation
---
<!-- README Version: v1.3 -->
# WAN2.1 VAE - 3D Causal Video Variational Autoencoder
WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework.
## Model Description
The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring:
- **3D Causal Architecture**: Maintains temporal causality across video sequences
- **Unlimited Length Support**: Can encode and decode unlimited-length 1080P videos without losing historical temporal information
- **High Compression Efficiency**: Advanced spatio-temporal compression with minimal quality loss
- **Memory Optimized**: Reduced memory footprint compared to traditional video VAEs
- **Temporal Information Preservation**: Ensures consistent temporal dynamics across long sequences
### Key Innovations
1. **Improved Spatio-Temporal Compression**: Enhanced compression ratios while maintaining visual fidelity
2. **Causal Temporal Processing**: Ensures frame-to-frame causality for coherent video generation
3. **Efficient Memory Usage**: Optimized for consumer-grade GPU deployment
4. **High-Resolution Support**: Native support for 1080P video encoding/decoding
## Repository Contents
```
E:\huggingface\wan21-vae\
βββ vae/
βββ wan/
βββ wan21-vae.safetensors (243 MB)
```
### Model Files
| File | Size | Format | Description |
|------|------|--------|-------------|
| `wan21-vae.safetensors` | 243 MB | SafeTensors | WAN2.1 VAE weights |
**Total Repository Size**: 243 MB
## Hardware Requirements
### Minimum Requirements
- **VRAM**: 4 GB (inference only)
- **RAM**: 8 GB system memory
- **Disk Space**: 500 MB (including dependencies)
- **GPU**: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent)
### Recommended Requirements
- **VRAM**: 8+ GB for optimal performance
- **RAM**: 16 GB system memory
- **Disk Space**: 1 GB
- **GPU**: NVIDIA RTX 3060 or better
### Resolution-Specific Requirements
- **480P Video**: 4-6 GB VRAM
- **720P Video**: 6-8 GB VRAM
- **1080P Video**: 8-12 GB VRAM
## Usage Examples
### Basic VAE Loading
```python
import torch
from diffusers import AutoencoderKL
# Load the WAN2.1 VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
).to("cuda")
print(f"VAE loaded: {vae.config}")
```
### Video Encoding Example
```python
import torch
from diffusers import AutoencoderKL
from PIL import Image
import numpy as np
# Load VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
).to("cuda")
# Prepare video frames (example with dummy data)
# Shape: [batch, channels, frames, height, width]
video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda")
# Encode video to latent space
with torch.no_grad():
latents = vae.encode(video_frames).latent_dist.sample()
print(f"Latent shape: {latents.shape}")
print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x")
```
### Video Decoding Example
```python
import torch
from diffusers import AutoencoderKL
# Load VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
).to("cuda")
# Decode latents back to video frames
# Assuming you have latents from encoding step
with torch.no_grad():
reconstructed_video = vae.decode(latents).sample
print(f"Reconstructed video shape: {reconstructed_video.shape}")
```
### Integration with WAN Models
```python
import torch
from diffusers import DiffusionPipeline, AutoencoderKL
# Load custom VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
)
# Load WAN model with custom VAE
pipe = DiffusionPipeline.from_pretrained(
"Wan-AI/Wan2.1-T2V-1.3B",
vae=vae,
torch_dtype=torch.float16
).to("cuda")
# Generate video
prompt = "A serene beach at sunset with waves crashing"
video = pipe(prompt, num_frames=16, height=480, width=720).frames
print(f"Generated video: {len(video)} frames")
```
## Model Specifications
### Architecture Details
- **Type**: 3D Causal Variational Autoencoder
- **Architecture**: Causal spatio-temporal convolutions
- **Compression**: Variable compression ratios (4x, 8x, 16x depending on configuration)
- **Causality**: Temporal causal processing for frame consistency
- **Latent Dimensions**: Optimized for video generation tasks
### Technical Specifications
- **Precision**: FP16 (Half precision) recommended
- **Format**: SafeTensors (secure, efficient loading)
- **Framework**: PyTorch >= 2.4.0
- **Library**: Diffusers (Hugging Face)
- **Temporal Support**: Unlimited frame sequences
- **Resolution Support**: Up to 1080P native
### Supported Operations
- Video encoding (frames β latents)
- Video decoding (latents β frames)
- Temporal compression
- Spatial compression
- Causal frame generation
## Performance Tips and Optimization
### Memory Optimization
```python
# Use gradient checkpointing for lower memory usage
vae.enable_gradient_checkpointing()
# Use CPU offloading for very large videos
vae.enable_sequential_cpu_offload()
# Use attention slicing for reduced VRAM
vae.enable_attention_slicing(1)
```
### Speed Optimization
```python
# Compile model for faster inference (PyTorch 2.0+)
vae = torch.compile(vae, mode="reduce-overhead")
# Use xFormers for efficient attention
vae.enable_xformers_memory_efficient_attention()
# Use half precision for faster inference
vae = vae.half()
```
### Batch Processing
```python
# Process multiple video clips efficiently
batch_size = 4
video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda")
with torch.no_grad():
latents = vae.encode(video_clips).latent_dist.sample()
```
### Resolution Guidelines
- **480P (854Γ480)**: Best for real-time applications, lowest VRAM
- **720P (1280Γ720)**: Balanced quality and performance
- **1080P (1920Γ1080)**: Maximum quality, requires high-end GPU
## License
This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions.
**License Type**: Other (Custom WAN License)
### Usage Restrictions
- Check official WAN-AI repository for commercial usage terms
- Attribution required for research and non-commercial use
- Refer to [WAN-AI Organization](https://huggingface.co/Wan-AI) for updates
## Citation
If you use this VAE in your research or applications, please cite the WAN project:
```bibtex
@misc{wan2025,
title={WAN: Open and Advanced Large-Scale Video Generative Models},
author={WAN-AI Team},
year={2025},
publisher={Hugging Face},
howpublished={https://huggingface.co/Wan-AI}
}
```
## Related Resources
### Official Links
- **WAN Organization**: https://huggingface.co/Wan-AI
- **WAN2.1 T2V 1.3B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
- **WAN2.1 T2V 14B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
- **WAN2.2 Models**: https://huggingface.co/Wan-AI (Latest versions)
- **GitHub Repository**: https://github.com/Wan-Video
### Related Models
- **WAN2.2 VAE**: Latest VAE with 64x compression (4Γ16Γ16)
- **WAN2.1 T2V**: Text-to-video generation models
- **WAN2.1 I2V**: Image-to-video generation models
- **WAN2.2 Animate**: Character animation models
### Community & Support
- Hugging Face WAN-AI discussions
- GitHub issues and community forums
- Research papers and technical documentation
## Model Card Contact
For questions, issues, or collaboration inquiries:
- Visit the [WAN-AI Hugging Face Organization](https://huggingface.co/Wan-AI)
- Check the [official GitHub repository](https://github.com/Wan-Video)
- Review model-specific documentation on individual model cards
---
**Version**: v1.3
**Last Updated**: 2025-10-14
**Model Size**: 243 MB
**Format**: SafeTensors
|