|
|
--- |
|
|
license: other |
|
|
library_name: diffusers |
|
|
pipeline_tag: text-to-video |
|
|
tags: |
|
|
- wan |
|
|
- text-to-video |
|
|
- image-generation |
|
|
--- |
|
|
|
|
|
<!-- README Version: v1.3 --> |
|
|
|
|
|
# WAN2.1 VAE - 3D Causal Video Variational Autoencoder |
|
|
|
|
|
WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring: |
|
|
|
|
|
- **3D Causal Architecture**: Maintains temporal causality across video sequences |
|
|
- **Unlimited Length Support**: Can encode and decode unlimited-length 1080P videos without losing historical temporal information |
|
|
- **High Compression Efficiency**: Advanced spatio-temporal compression with minimal quality loss |
|
|
- **Memory Optimized**: Reduced memory footprint compared to traditional video VAEs |
|
|
- **Temporal Information Preservation**: Ensures consistent temporal dynamics across long sequences |
|
|
|
|
|
### Key Innovations |
|
|
|
|
|
1. **Improved Spatio-Temporal Compression**: Enhanced compression ratios while maintaining visual fidelity |
|
|
2. **Causal Temporal Processing**: Ensures frame-to-frame causality for coherent video generation |
|
|
3. **Efficient Memory Usage**: Optimized for consumer-grade GPU deployment |
|
|
4. **High-Resolution Support**: Native support for 1080P video encoding/decoding |
|
|
|
|
|
## Repository Contents |
|
|
|
|
|
``` |
|
|
E:\huggingface\wan21-vae\ |
|
|
βββ vae/ |
|
|
βββ wan/ |
|
|
βββ wan21-vae.safetensors (243 MB) |
|
|
``` |
|
|
|
|
|
### Model Files |
|
|
|
|
|
| File | Size | Format | Description | |
|
|
|------|------|--------|-------------| |
|
|
| `wan21-vae.safetensors` | 243 MB | SafeTensors | WAN2.1 VAE weights | |
|
|
|
|
|
**Total Repository Size**: 243 MB |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Minimum Requirements |
|
|
- **VRAM**: 4 GB (inference only) |
|
|
- **RAM**: 8 GB system memory |
|
|
- **Disk Space**: 500 MB (including dependencies) |
|
|
- **GPU**: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent) |
|
|
|
|
|
### Recommended Requirements |
|
|
- **VRAM**: 8+ GB for optimal performance |
|
|
- **RAM**: 16 GB system memory |
|
|
- **Disk Space**: 1 GB |
|
|
- **GPU**: NVIDIA RTX 3060 or better |
|
|
|
|
|
### Resolution-Specific Requirements |
|
|
- **480P Video**: 4-6 GB VRAM |
|
|
- **720P Video**: 6-8 GB VRAM |
|
|
- **1080P Video**: 8-12 GB VRAM |
|
|
|
|
|
## Usage Examples |
|
|
|
|
|
### Basic VAE Loading |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from diffusers import AutoencoderKL |
|
|
|
|
|
# Load the WAN2.1 VAE |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/wan21-vae/vae/wan", |
|
|
torch_dtype=torch.float16 |
|
|
).to("cuda") |
|
|
|
|
|
print(f"VAE loaded: {vae.config}") |
|
|
``` |
|
|
|
|
|
### Video Encoding Example |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from diffusers import AutoencoderKL |
|
|
from PIL import Image |
|
|
import numpy as np |
|
|
|
|
|
# Load VAE |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/wan21-vae/vae/wan", |
|
|
torch_dtype=torch.float16 |
|
|
).to("cuda") |
|
|
|
|
|
# Prepare video frames (example with dummy data) |
|
|
# Shape: [batch, channels, frames, height, width] |
|
|
video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda") |
|
|
|
|
|
# Encode video to latent space |
|
|
with torch.no_grad(): |
|
|
latents = vae.encode(video_frames).latent_dist.sample() |
|
|
|
|
|
print(f"Latent shape: {latents.shape}") |
|
|
print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x") |
|
|
``` |
|
|
|
|
|
### Video Decoding Example |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from diffusers import AutoencoderKL |
|
|
|
|
|
# Load VAE |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/wan21-vae/vae/wan", |
|
|
torch_dtype=torch.float16 |
|
|
).to("cuda") |
|
|
|
|
|
# Decode latents back to video frames |
|
|
# Assuming you have latents from encoding step |
|
|
with torch.no_grad(): |
|
|
reconstructed_video = vae.decode(latents).sample |
|
|
|
|
|
print(f"Reconstructed video shape: {reconstructed_video.shape}") |
|
|
``` |
|
|
|
|
|
### Integration with WAN Models |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from diffusers import DiffusionPipeline, AutoencoderKL |
|
|
|
|
|
# Load custom VAE |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/wan21-vae/vae/wan", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
# Load WAN model with custom VAE |
|
|
pipe = DiffusionPipeline.from_pretrained( |
|
|
"Wan-AI/Wan2.1-T2V-1.3B", |
|
|
vae=vae, |
|
|
torch_dtype=torch.float16 |
|
|
).to("cuda") |
|
|
|
|
|
# Generate video |
|
|
prompt = "A serene beach at sunset with waves crashing" |
|
|
video = pipe(prompt, num_frames=16, height=480, width=720).frames |
|
|
|
|
|
print(f"Generated video: {len(video)} frames") |
|
|
``` |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
### Architecture Details |
|
|
- **Type**: 3D Causal Variational Autoencoder |
|
|
- **Architecture**: Causal spatio-temporal convolutions |
|
|
- **Compression**: Variable compression ratios (4x, 8x, 16x depending on configuration) |
|
|
- **Causality**: Temporal causal processing for frame consistency |
|
|
- **Latent Dimensions**: Optimized for video generation tasks |
|
|
|
|
|
### Technical Specifications |
|
|
- **Precision**: FP16 (Half precision) recommended |
|
|
- **Format**: SafeTensors (secure, efficient loading) |
|
|
- **Framework**: PyTorch >= 2.4.0 |
|
|
- **Library**: Diffusers (Hugging Face) |
|
|
- **Temporal Support**: Unlimited frame sequences |
|
|
- **Resolution Support**: Up to 1080P native |
|
|
|
|
|
### Supported Operations |
|
|
- Video encoding (frames β latents) |
|
|
- Video decoding (latents β frames) |
|
|
- Temporal compression |
|
|
- Spatial compression |
|
|
- Causal frame generation |
|
|
|
|
|
## Performance Tips and Optimization |
|
|
|
|
|
### Memory Optimization |
|
|
```python |
|
|
# Use gradient checkpointing for lower memory usage |
|
|
vae.enable_gradient_checkpointing() |
|
|
|
|
|
# Use CPU offloading for very large videos |
|
|
vae.enable_sequential_cpu_offload() |
|
|
|
|
|
# Use attention slicing for reduced VRAM |
|
|
vae.enable_attention_slicing(1) |
|
|
``` |
|
|
|
|
|
### Speed Optimization |
|
|
```python |
|
|
# Compile model for faster inference (PyTorch 2.0+) |
|
|
vae = torch.compile(vae, mode="reduce-overhead") |
|
|
|
|
|
# Use xFormers for efficient attention |
|
|
vae.enable_xformers_memory_efficient_attention() |
|
|
|
|
|
# Use half precision for faster inference |
|
|
vae = vae.half() |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
```python |
|
|
# Process multiple video clips efficiently |
|
|
batch_size = 4 |
|
|
video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda") |
|
|
|
|
|
with torch.no_grad(): |
|
|
latents = vae.encode(video_clips).latent_dist.sample() |
|
|
``` |
|
|
|
|
|
### Resolution Guidelines |
|
|
- **480P (854Γ480)**: Best for real-time applications, lowest VRAM |
|
|
- **720P (1280Γ720)**: Balanced quality and performance |
|
|
- **1080P (1920Γ1080)**: Maximum quality, requires high-end GPU |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions. |
|
|
|
|
|
**License Type**: Other (Custom WAN License) |
|
|
|
|
|
### Usage Restrictions |
|
|
- Check official WAN-AI repository for commercial usage terms |
|
|
- Attribution required for research and non-commercial use |
|
|
- Refer to [WAN-AI Organization](https://huggingface.co/Wan-AI) for updates |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this VAE in your research or applications, please cite the WAN project: |
|
|
|
|
|
```bibtex |
|
|
@misc{wan2025, |
|
|
title={WAN: Open and Advanced Large-Scale Video Generative Models}, |
|
|
author={WAN-AI Team}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={https://huggingface.co/Wan-AI} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Resources |
|
|
|
|
|
### Official Links |
|
|
- **WAN Organization**: https://huggingface.co/Wan-AI |
|
|
- **WAN2.1 T2V 1.3B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B |
|
|
- **WAN2.1 T2V 14B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B |
|
|
- **WAN2.2 Models**: https://huggingface.co/Wan-AI (Latest versions) |
|
|
- **GitHub Repository**: https://github.com/Wan-Video |
|
|
|
|
|
### Related Models |
|
|
- **WAN2.2 VAE**: Latest VAE with 64x compression (4Γ16Γ16) |
|
|
- **WAN2.1 T2V**: Text-to-video generation models |
|
|
- **WAN2.1 I2V**: Image-to-video generation models |
|
|
- **WAN2.2 Animate**: Character animation models |
|
|
|
|
|
### Community & Support |
|
|
- Hugging Face WAN-AI discussions |
|
|
- GitHub issues and community forums |
|
|
- Research papers and technical documentation |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions, issues, or collaboration inquiries: |
|
|
- Visit the [WAN-AI Hugging Face Organization](https://huggingface.co/Wan-AI) |
|
|
- Check the [official GitHub repository](https://github.com/Wan-Video) |
|
|
- Review model-specific documentation on individual model cards |
|
|
|
|
|
--- |
|
|
|
|
|
**Version**: v1.3 |
|
|
**Last Updated**: 2025-10-14 |
|
|
**Model Size**: 243 MB |
|
|
**Format**: SafeTensors |
|
|
|