wan21-vae / README.md
wangkanai's picture
Add files using upload-large-folder tool
c8df74e verified
---
license: other
library_name: diffusers
pipeline_tag: text-to-video
tags:
- wan
- text-to-video
- image-generation
---
<!-- README Version: v1.3 -->
# WAN2.1 VAE - 3D Causal Video Variational Autoencoder
WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework.
## Model Description
The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring:
- **3D Causal Architecture**: Maintains temporal causality across video sequences
- **Unlimited Length Support**: Can encode and decode unlimited-length 1080P videos without losing historical temporal information
- **High Compression Efficiency**: Advanced spatio-temporal compression with minimal quality loss
- **Memory Optimized**: Reduced memory footprint compared to traditional video VAEs
- **Temporal Information Preservation**: Ensures consistent temporal dynamics across long sequences
### Key Innovations
1. **Improved Spatio-Temporal Compression**: Enhanced compression ratios while maintaining visual fidelity
2. **Causal Temporal Processing**: Ensures frame-to-frame causality for coherent video generation
3. **Efficient Memory Usage**: Optimized for consumer-grade GPU deployment
4. **High-Resolution Support**: Native support for 1080P video encoding/decoding
## Repository Contents
```
E:\huggingface\wan21-vae\
└── vae/
└── wan/
└── wan21-vae.safetensors (243 MB)
```
### Model Files
| File | Size | Format | Description |
|------|------|--------|-------------|
| `wan21-vae.safetensors` | 243 MB | SafeTensors | WAN2.1 VAE weights |
**Total Repository Size**: 243 MB
## Hardware Requirements
### Minimum Requirements
- **VRAM**: 4 GB (inference only)
- **RAM**: 8 GB system memory
- **Disk Space**: 500 MB (including dependencies)
- **GPU**: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent)
### Recommended Requirements
- **VRAM**: 8+ GB for optimal performance
- **RAM**: 16 GB system memory
- **Disk Space**: 1 GB
- **GPU**: NVIDIA RTX 3060 or better
### Resolution-Specific Requirements
- **480P Video**: 4-6 GB VRAM
- **720P Video**: 6-8 GB VRAM
- **1080P Video**: 8-12 GB VRAM
## Usage Examples
### Basic VAE Loading
```python
import torch
from diffusers import AutoencoderKL
# Load the WAN2.1 VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
).to("cuda")
print(f"VAE loaded: {vae.config}")
```
### Video Encoding Example
```python
import torch
from diffusers import AutoencoderKL
from PIL import Image
import numpy as np
# Load VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
).to("cuda")
# Prepare video frames (example with dummy data)
# Shape: [batch, channels, frames, height, width]
video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda")
# Encode video to latent space
with torch.no_grad():
latents = vae.encode(video_frames).latent_dist.sample()
print(f"Latent shape: {latents.shape}")
print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x")
```
### Video Decoding Example
```python
import torch
from diffusers import AutoencoderKL
# Load VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
).to("cuda")
# Decode latents back to video frames
# Assuming you have latents from encoding step
with torch.no_grad():
reconstructed_video = vae.decode(latents).sample
print(f"Reconstructed video shape: {reconstructed_video.shape}")
```
### Integration with WAN Models
```python
import torch
from diffusers import DiffusionPipeline, AutoencoderKL
# Load custom VAE
vae = AutoencoderKL.from_pretrained(
"E:/huggingface/wan21-vae/vae/wan",
torch_dtype=torch.float16
)
# Load WAN model with custom VAE
pipe = DiffusionPipeline.from_pretrained(
"Wan-AI/Wan2.1-T2V-1.3B",
vae=vae,
torch_dtype=torch.float16
).to("cuda")
# Generate video
prompt = "A serene beach at sunset with waves crashing"
video = pipe(prompt, num_frames=16, height=480, width=720).frames
print(f"Generated video: {len(video)} frames")
```
## Model Specifications
### Architecture Details
- **Type**: 3D Causal Variational Autoencoder
- **Architecture**: Causal spatio-temporal convolutions
- **Compression**: Variable compression ratios (4x, 8x, 16x depending on configuration)
- **Causality**: Temporal causal processing for frame consistency
- **Latent Dimensions**: Optimized for video generation tasks
### Technical Specifications
- **Precision**: FP16 (Half precision) recommended
- **Format**: SafeTensors (secure, efficient loading)
- **Framework**: PyTorch >= 2.4.0
- **Library**: Diffusers (Hugging Face)
- **Temporal Support**: Unlimited frame sequences
- **Resolution Support**: Up to 1080P native
### Supported Operations
- Video encoding (frames β†’ latents)
- Video decoding (latents β†’ frames)
- Temporal compression
- Spatial compression
- Causal frame generation
## Performance Tips and Optimization
### Memory Optimization
```python
# Use gradient checkpointing for lower memory usage
vae.enable_gradient_checkpointing()
# Use CPU offloading for very large videos
vae.enable_sequential_cpu_offload()
# Use attention slicing for reduced VRAM
vae.enable_attention_slicing(1)
```
### Speed Optimization
```python
# Compile model for faster inference (PyTorch 2.0+)
vae = torch.compile(vae, mode="reduce-overhead")
# Use xFormers for efficient attention
vae.enable_xformers_memory_efficient_attention()
# Use half precision for faster inference
vae = vae.half()
```
### Batch Processing
```python
# Process multiple video clips efficiently
batch_size = 4
video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda")
with torch.no_grad():
latents = vae.encode(video_clips).latent_dist.sample()
```
### Resolution Guidelines
- **480P (854Γ—480)**: Best for real-time applications, lowest VRAM
- **720P (1280Γ—720)**: Balanced quality and performance
- **1080P (1920Γ—1080)**: Maximum quality, requires high-end GPU
## License
This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions.
**License Type**: Other (Custom WAN License)
### Usage Restrictions
- Check official WAN-AI repository for commercial usage terms
- Attribution required for research and non-commercial use
- Refer to [WAN-AI Organization](https://huggingface.co/Wan-AI) for updates
## Citation
If you use this VAE in your research or applications, please cite the WAN project:
```bibtex
@misc{wan2025,
title={WAN: Open and Advanced Large-Scale Video Generative Models},
author={WAN-AI Team},
year={2025},
publisher={Hugging Face},
howpublished={https://huggingface.co/Wan-AI}
}
```
## Related Resources
### Official Links
- **WAN Organization**: https://huggingface.co/Wan-AI
- **WAN2.1 T2V 1.3B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
- **WAN2.1 T2V 14B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
- **WAN2.2 Models**: https://huggingface.co/Wan-AI (Latest versions)
- **GitHub Repository**: https://github.com/Wan-Video
### Related Models
- **WAN2.2 VAE**: Latest VAE with 64x compression (4Γ—16Γ—16)
- **WAN2.1 T2V**: Text-to-video generation models
- **WAN2.1 I2V**: Image-to-video generation models
- **WAN2.2 Animate**: Character animation models
### Community & Support
- Hugging Face WAN-AI discussions
- GitHub issues and community forums
- Research papers and technical documentation
## Model Card Contact
For questions, issues, or collaboration inquiries:
- Visit the [WAN-AI Hugging Face Organization](https://huggingface.co/Wan-AI)
- Check the [official GitHub repository](https://github.com/Wan-Video)
- Review model-specific documentation on individual model cards
---
**Version**: v1.3
**Last Updated**: 2025-10-14
**Model Size**: 243 MB
**Format**: SafeTensors