wan21-vae / README.md

Add files using upload-large-folder tool

c8df74e verified 3 months ago

8.15 kB

	---
	license: other
	library_name: diffusers
	pipeline_tag: text-to-video
	tags:
	- wan
	- text-to-video
	- image-generation
	---

	<!-- README Version: v1.3 -->

	# WAN2.1 VAE - 3D Causal Video Variational Autoencoder

	WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework.

	## Model Description

	The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring:

	- 3D Causal Architecture: Maintains temporal causality across video sequences
	- Unlimited Length Support: Can encode and decode unlimited-length 1080P videos without losing historical temporal information
	- High Compression Efficiency: Advanced spatio-temporal compression with minimal quality loss
	- Memory Optimized: Reduced memory footprint compared to traditional video VAEs
	- Temporal Information Preservation: Ensures consistent temporal dynamics across long sequences

	### Key Innovations

	1. Improved Spatio-Temporal Compression: Enhanced compression ratios while maintaining visual fidelity
	2. Causal Temporal Processing: Ensures frame-to-frame causality for coherent video generation
	3. Efficient Memory Usage: Optimized for consumer-grade GPU deployment
	4. High-Resolution Support: Native support for 1080P video encoding/decoding

	## Repository Contents

	```
	E:\huggingface\wan21-vae\
	└── vae/
	└── wan/
	└── wan21-vae.safetensors (243 MB)
	```

	### Model Files

	\| File \| Size \| Format \| Description \|
	\|------\|------\|--------\|-------------\|
	\| `wan21-vae.safetensors` \| 243 MB \| SafeTensors \| WAN2.1 VAE weights \|

	Total Repository Size: 243 MB

	## Hardware Requirements

	### Minimum Requirements
	- VRAM: 4 GB (inference only)
	- RAM: 8 GB system memory
	- Disk Space: 500 MB (including dependencies)
	- GPU: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent)

	### Recommended Requirements
	- VRAM: 8+ GB for optimal performance
	- RAM: 16 GB system memory
	- Disk Space: 1 GB
	- GPU: NVIDIA RTX 3060 or better

	### Resolution-Specific Requirements
	- 480P Video: 4-6 GB VRAM
	- 720P Video: 6-8 GB VRAM
	- 1080P Video: 8-12 GB VRAM

	## Usage Examples

	### Basic VAE Loading

	```python
	import torch
	from diffusers import AutoencoderKL

	# Load the WAN2.1 VAE
	vae = AutoencoderKL.from_pretrained(
	"E:/huggingface/wan21-vae/vae/wan",
	torch_dtype=torch.float16
	).to("cuda")

	print(f"VAE loaded: {vae.config}")
	```

	### Video Encoding Example

	```python
	import torch
	from diffusers import AutoencoderKL
	from PIL import Image
	import numpy as np

	# Load VAE
	vae = AutoencoderKL.from_pretrained(
	"E:/huggingface/wan21-vae/vae/wan",
	torch_dtype=torch.float16
	).to("cuda")

	# Prepare video frames (example with dummy data)
	# Shape: [batch, channels, frames, height, width]
	video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda")

	# Encode video to latent space
	with torch.no_grad():
	latents = vae.encode(video_frames).latent_dist.sample()

	print(f"Latent shape: {latents.shape}")
	print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x")
	```

	### Video Decoding Example

	```python
	import torch
	from diffusers import AutoencoderKL

	# Load VAE
	vae = AutoencoderKL.from_pretrained(
	"E:/huggingface/wan21-vae/vae/wan",
	torch_dtype=torch.float16
	).to("cuda")

	# Decode latents back to video frames
	# Assuming you have latents from encoding step
	with torch.no_grad():
	reconstructed_video = vae.decode(latents).sample

	print(f"Reconstructed video shape: {reconstructed_video.shape}")
	```

	### Integration with WAN Models

	```python
	import torch
	from diffusers import DiffusionPipeline, AutoencoderKL

	# Load custom VAE
	vae = AutoencoderKL.from_pretrained(
	"E:/huggingface/wan21-vae/vae/wan",
	torch_dtype=torch.float16
	)

	# Load WAN model with custom VAE
	pipe = DiffusionPipeline.from_pretrained(
	"Wan-AI/Wan2.1-T2V-1.3B",
	vae=vae,
	torch_dtype=torch.float16
	).to("cuda")

	# Generate video
	prompt = "A serene beach at sunset with waves crashing"
	video = pipe(prompt, num_frames=16, height=480, width=720).frames

	print(f"Generated video: {len(video)} frames")
	```

	## Model Specifications

	### Architecture Details
	- Type: 3D Causal Variational Autoencoder
	- Architecture: Causal spatio-temporal convolutions
	- Compression: Variable compression ratios (4x, 8x, 16x depending on configuration)
	- Causality: Temporal causal processing for frame consistency
	- Latent Dimensions: Optimized for video generation tasks

	### Technical Specifications
	- Precision: FP16 (Half precision) recommended
	- Format: SafeTensors (secure, efficient loading)
	- Framework: PyTorch >= 2.4.0
	- Library: Diffusers (Hugging Face)
	- Temporal Support: Unlimited frame sequences
	- Resolution Support: Up to 1080P native

	### Supported Operations
	- Video encoding (frames → latents)
	- Video decoding (latents → frames)
	- Temporal compression
	- Spatial compression
	- Causal frame generation

	## Performance Tips and Optimization

	### Memory Optimization
	```python
	# Use gradient checkpointing for lower memory usage
	vae.enable_gradient_checkpointing()

	# Use CPU offloading for very large videos
	vae.enable_sequential_cpu_offload()

	# Use attention slicing for reduced VRAM
	vae.enable_attention_slicing(1)
	```

	### Speed Optimization
	```python
	# Compile model for faster inference (PyTorch 2.0+)
	vae = torch.compile(vae, mode="reduce-overhead")

	# Use xFormers for efficient attention
	vae.enable_xformers_memory_efficient_attention()

	# Use half precision for faster inference
	vae = vae.half()
	```

	### Batch Processing
	```python
	# Process multiple video clips efficiently
	batch_size = 4
	video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda")

	with torch.no_grad():
	latents = vae.encode(video_clips).latent_dist.sample()
	```

	### Resolution Guidelines
	- 480P (854×480): Best for real-time applications, lowest VRAM
	- 720P (1280×720): Balanced quality and performance
	- 1080P (1920×1080): Maximum quality, requires high-end GPU

	## License

	This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions.

	License Type: Other (Custom WAN License)

	### Usage Restrictions
	- Check official WAN-AI repository for commercial usage terms
	- Attribution required for research and non-commercial use
	- Refer to [WAN-AI Organization](https://huggingface.co/Wan-AI) for updates

	## Citation

	If you use this VAE in your research or applications, please cite the WAN project:

	```bibtex
	@misc{wan2025,
	title={WAN: Open and Advanced Large-Scale Video Generative Models},
	author={WAN-AI Team},
	year={2025},
	publisher={Hugging Face},
	howpublished={https://huggingface.co/Wan-AI}
	}
	```

	## Related Resources

	### Official Links
	- WAN Organization: https://huggingface.co/Wan-AI
	- WAN2.1 T2V 1.3B Model: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B
	- WAN2.1 T2V 14B Model: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
	- WAN2.2 Models: https://huggingface.co/Wan-AI (Latest versions)
	- GitHub Repository: https://github.com/Wan-Video

	### Related Models
	- WAN2.2 VAE: Latest VAE with 64x compression (4×16×16)
	- WAN2.1 T2V: Text-to-video generation models
	- WAN2.1 I2V: Image-to-video generation models
	- WAN2.2 Animate: Character animation models

	### Community & Support
	- Hugging Face WAN-AI discussions
	- GitHub issues and community forums
	- Research papers and technical documentation

	## Model Card Contact

	For questions, issues, or collaboration inquiries:
	- Visit the [WAN-AI Hugging Face Organization](https://huggingface.co/Wan-AI)
	- Check the [official GitHub repository](https://github.com/Wan-Video)
	- Review model-specific documentation on individual model cards

	---

	Version: v1.3
	Last Updated: 2025-10-14
	Model Size: 243 MB
	Format: SafeTensors