wangkanai
/

wan25-vae

+<!-- README Version: v1.0 -->
+---
+license: other
+license_name: wan-license
+library_name: diffusers
+pipeline_tag: text-to-video
+tags:
+  - video-generation
+  - vae
+  - wan
+  - autoencoder
+  - latent-space
+  - video-compression
+  - wan2.5
+base_model: Wan-AI/Wan2.5
+base_model_relation: component
+---
+# WAN25 VAE - Video Autoencoder v1.0
+⚠️ **Repository Status**: This repository is currently a placeholder for WAN 2.5 VAE models. The directory structure is prepared but model files have not yet been downloaded.
+High-performance Variational Autoencoder (VAE) component for the WAN 2.5 (World Anything Now) video generation system. This VAE provides efficient latent space encoding and decoding for video content, enabling high-quality video generation with reduced computational requirements.
+## Model Description
+The WAN25-VAE is the next-generation variational autoencoder designed for video content processing in the WAN 2.5 video generation pipeline. Building on the advances of WAN 2.1 and WAN 2.2 VAE architectures, it compresses video frames into a compact latent representation and reconstructs them with high fidelity, enabling efficient text-to-video and image-to-video generation workflows.
+### Key Capabilities (Expected)
+- **Advanced Video Compression**: Efficient encoding of video frames into latent space representations with improved compression ratios
+- **High Fidelity Reconstruction**: Accurate decoding back to pixel space with minimal quality loss
+- **Temporal Coherence**: Enhanced consistency across video frames during encoding/decoding
+- **Memory Efficient**: Reduced VRAM requirements during video generation inference
+- **Compatible Pipeline Integration**: Seamlessly integrates with WAN 2.5 video generation models
+- **Native Audio Support**: Expected integration with audio-visual generation capabilities
+### Technical Highlights
+- Optimized architecture for temporal video data processing with spatio-temporal convolutions
+- 3D causal VAE architecture ensuring temporal coherence
+- Supports various frame rates and resolutions (480P, 720P, 1080P)
+- Expected compression ratio improvements over WAN 2.2 VAE (4×16×16)
+- Low latency encoding/decoding for real-time applications
+- Precision-optimized for stable inference on consumer hardware
+### WAN VAE Evolution
+| Version | Compression Ratio | Key Features |
+|---------|------------------|--------------|
+| **WAN 2.1 VAE** | 4×8×8 (temporal×spatial) | Initial 3D causal VAE, efficient 1080P encoding |
+| **WAN 2.2 VAE** | 4×16×16 | Enhanced compression (64x overall), improved quality |
+| **WAN 2.5 VAE** | TBD | Expected: Audio-visual integration, further optimizations |
+## Repository Contents
+```
+wan25-vae/
+└── vae/
+    └── wan/
+        └── (Model files pending download)
+```
+**Current Status**: Directory structure prepared, awaiting model file downloads.
+### Expected File Structure
+| File | Expected Size | Description |
+|------|--------------|-------------|
+| `wan25-vae.safetensors` | ~1.5-2.0 GB | WAN25 VAE model weights in safetensors format |
+| `config.json` | ~1-5 KB | Model configuration and architecture parameters |
+## Hardware Requirements
+### Minimum Requirements (Estimated)
+- **VRAM**: 2-3 GB (VAE inference only)
+- **System RAM**: 4 GB
+- **Disk Space**: 2.5 GB free space
+- **GPU**: CUDA-compatible GPU (NVIDIA) or compatible accelerator
+### Recommended Specifications
+- **VRAM**: 6+ GB for comfortable operation with video generation pipeline
+- **System RAM**: 16+ GB
+- **GPU**: NVIDIA RTX 3060 or better, RTX 4060+ recommended
+- **Storage**: SSD for faster model loading
+### Performance Notes
+- VAE operations are typically memory-bound rather than compute-bound
+- Larger batch sizes require proportionally more VRAM
+- CPU inference is possible but significantly slower (30-50x)
+- WAN 2.5 may include audio processing requiring additional compute
+## Usage Examples
+### Basic Usage with Diffusers (Placeholder)
+```python
+import torch
+from diffusers import AutoencoderKL
+# Load the WAN25 VAE (when available)
+vae_path = r"E:\huggingface\wan25-vae\vae\wan"
+vae = AutoencoderKL.from_pretrained(
+    vae_path,
+    torch_dtype=torch.float16
+)
+# Move to GPU
+device = "cuda" if torch.cuda.is_available() else "cpu"
+vae = vae.to(device)
+# Encode video frames to latent space
+# video_frames: tensor of shape [batch, channels, height, width]
+with torch.no_grad():
+    latents = vae.encode(video_frames).latent_dist.sample()
+    latents = latents * vae.config.scaling_factor
+# Decode latents back to pixel space
+with torch.no_grad():
+    decoded_frames = vae.decode(latents / vae.config.scaling_factor).sample
+```
+### Integration with WAN 2.5 Video Generation Pipeline
+```python
+import torch
+from diffusers import DiffusionPipeline
+# Load WAN 2.5 video generation pipeline with custom VAE
+pipeline = DiffusionPipeline.from_pretrained(
+    "Wan-AI/Wan2.5-T2V",  # Example WAN 2.5 model path
+    vae=vae,  # Use the loaded WAN25-VAE
+    torch_dtype=torch.float16
+)
+pipeline = pipeline.to("cuda")
+# Generate video from text prompt
+prompt = "A serene sunset over mountains with flowing clouds and ambient nature sounds"
+video_frames = pipeline(
+    prompt=prompt,
+    num_frames=48,  # WAN 2.5 may support longer sequences
+    height=720,
+    width=1280,
+    num_inference_steps=50
+).frames
+```
+### Memory-Efficient Video Processing
+```python
+import torch
+# Enable memory-efficient attention for large videos
+vae.enable_xformers_memory_efficient_attention()
+# Process video in smaller chunks
+def encode_video_chunks(video_tensor, chunk_size=8):
+    """Encode video frames in chunks to reduce VRAM usage"""
+    latents = []
+    for i in range(0, video_tensor.shape[0], chunk_size):
+        chunk = video_tensor[i:i+chunk_size].to(device)
+        with torch.no_grad():
+            chunk_latents = vae.encode(chunk).latent_dist.sample()
+            latents.append(chunk_latents.cpu())
+    return torch.cat(latents, dim=0)
+```
+### Advanced Latent Space Operations
+```python
+import torch
+import numpy as np
+# Encode input video
+latents = vae.encode(input_frames).latent_dist.sample()
+# Apply transformations in latent space (e.g., interpolation)
+latents_start = latents[0]
+latents_end = latents[-1]
+# Create smooth interpolation between frames
+interpolated_latents = []
+for alpha in np.linspace(0, 1, 24):
+    interpolated = (1 - alpha) * latents_start + alpha * latents_end
+    interpolated_latents.append(interpolated)
+# Decode interpolated latents
+smooth_video = vae.decode(torch.stack(interpolated_latents)).sample
+```
+## Model Specifications
+### Architecture Details (Expected)
+- **Model Type**: Spatio-Temporal Variational Autoencoder (3D Causal VAE)
+- **Architecture**: Convolutional encoder-decoder with KL divergence regularization
+- **Input Format**: Video frames (RGB) with potential audio integration
+- **Latent Dimensions**: Compressed spatial resolution with channel expansion
+- **Temporal Processing**: 3D causal convolutions for temporal coherence
+- **Activation Functions**: Mixed (SiLU, tanh for output)
+### Technical Specifications
+- **Format**: SafeTensors (secure, efficient binary format)
+- **Precision**: Mixed precision compatible (FP16/FP32/BF16)
+- **Framework**: PyTorch-based, compatible with Diffusers library
+- **Parameters**: Estimated ~400-500M parameters (based on WAN 2.2 progression)
+- **Compression Ratio**: Expected improvements over WAN 2.2's 4×16×16
+- **Perceptual Optimization**: Pre-trained perceptual networks for quality preservation
+### Supported Input Resolutions
+- **Standard**: 480P (854×480), 720P (1280×720), 1080P (1920×1080)
+- **Aspect Ratios**: 16:9, 4:3, 1:1, and custom ratios
+- **Frame Rates**: 24fps, 30fps, 60fps support expected
+## Performance Tips and Optimization
+### Memory Optimization
+```python
+# Enable gradient checkpointing for training (if fine-tuning)
+vae.enable_gradient_checkpointing()
+# Use float16 for inference to reduce VRAM usage
+vae = vae.half()
+# Process frames in batches
+batch_size = 4  # Adjust based on available VRAM
+# Enable CPU offloading for large models
+vae.enable_model_cpu_offload()
+```
+### Speed Optimization
+```python
+# Compile model with torch.compile (PyTorch 2.0+)
+vae = torch.compile(vae, mode="reduce-overhead")
+# Use channels_last memory format for better performance
+vae = vae.to(memory_format=torch.channels_last)
+# Enable TF32 on Ampere+ GPUs
+torch.backends.cuda.matmul.allow_tf32 = True
+torch.backends.cudnn.allow_tf32 = True
+# Use xFormers for memory-efficient attention
+vae.enable_xformers_memory_efficient_attention()
+```
+### Quality vs Speed Trade-offs
+- **High Quality**: Use FP32 precision, larger batch sizes, disable tiling
+- **Balanced**: FP16 precision, moderate batch sizes (4-8 frames)
+- **Fast Inference**: FP16 precision, smaller batches (1-2 frames), enable tiling
+- **Ultra Fast**: BF16 precision, aggressive tiling, model compilation
+### Best Practices
+- Always use safetensors format for security and compatibility
+- Monitor VRAM usage with `torch.cuda.memory_allocated()`
+- Clear cache between large operations: `torch.cuda.empty_cache()`
+- Use mixed precision training if fine-tuning the VAE
+- Validate reconstruction quality with perceptual metrics (LPIPS, SSIM, PSNR)
+- Consider using video-specific quality metrics (VMAF, VQM)
+## Getting Started
+### Step 1: Download WAN 2.5 VAE Model
+When WAN 2.5 VAE becomes available, download from Hugging Face:
+```bash
+# Using huggingface_hub
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id="Wan-AI/Wan2.5-VAE",  # Check official repo name
+    local_dir="E:/huggingface/wan25-vae/vae/wan",
+    allow_patterns=["*.safetensors", "*.json"]
+)
+```
+Or use git-lfs:
+```bash
+cd E:/huggingface/wan25-vae/vae/wan
+git lfs install
+git clone https://huggingface.co/Wan-AI/Wan2.5-VAE .
+```
+### Step 2: Install Dependencies
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+pip install diffusers transformers accelerate xformers safetensors
+```
+### Step 3: Verify Installation
+```python
+import torch
+from diffusers import AutoencoderKL
+# Check if model files exist
+import os
+vae_path = r"E:\huggingface\wan25-vae\vae\wan"
+if os.path.exists(os.path.join(vae_path, "config.json")):
+    print("✓ WAN25 VAE model found")
+    vae = AutoencoderKL.from_pretrained(vae_path)
+    print(f"✓ Model loaded successfully with {sum(p.numel() for p in vae.parameters())/1e6:.1f}M parameters")
+else:
+    print("✗ WAN25 VAE model not found. Please download first.")
+```
+## License
+This model is released under a custom WAN license. Please review the license terms before use:
+- **Commercial Use**: Subject to WAN license terms and conditions
+- **Research Use**: Generally permitted with proper attribution
+- **Redistribution**: Refer to original WAN model license
+- **Modifications**: Check license for derivative work permissions
+For complete license details, refer to the official WAN model repository or license documentation at:
+- https://huggingface.co/Wan-AI
+- https://wan.video/
+## Citation
+If you use this VAE in your research or projects, please cite:
+```bibtex
+@misc{wan25-vae,
+  title={WAN25 VAE: Advanced Video Variational Autoencoder for WAN 2.5 Video Generation},
+  author={WAN Model Team},
+  year={2025},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/Wan-AI/Wan2.5-VAE}}
+}
+```
+For the broader WAN 2.5 system:
+```bibtex
+@article{wan2025,
+  title={Wan: Open and Advanced Large-Scale Video Generative Models},
+  author={WAN Research Team},
+  journal={arXiv preprint},
+  year={2025}
+}
+```
+## Related Resources
+### Official Links
+- **WAN Official Website**: [https://wan.video/](https://wan.video/)
+- **WAN 2.5 Announcement**: [https://wan25.ai/](https://wan25.ai/)
+- **Hugging Face Organization**: [https://huggingface.co/Wan-AI](https://huggingface.co/Wan-AI)
+- **GitHub Repository**: [https://github.com/Wan-Video](https://github.com/Wan-Video)
+- **Diffusers Documentation**: [https://huggingface.co/docs/diffusers](https://huggingface.co/docs/diffusers)
+- **Model Hub**: [https://huggingface.co/models](https://huggingface.co/models)
+### Related WAN Models (Local Repository)
+- **WAN 2.1 VAE**: `E:\huggingface\wan21-vae\` - Previous generation VAE
+- **WAN 2.2 VAE**: `E:\huggingface\wan22-vae\` - Current generation VAE (1.4 GB)
+- **WAN 2.5 FP16**: `E:\huggingface\wan25-fp16\` - Main model in FP16 precision
+- **WAN 2.5 FP8**: `E:\huggingface\wan25-fp8\` - Optimized FP8 variant
+- **WAN 2.5 LoRAs**: `E:\huggingface\wan25-fp16-loras\` - Enhancement modules
+### Community Resources
+- **WAN Community**: Discussions and examples for WAN video generation
+- **Video Generation Papers**: Research on video diffusion and VAE architectures
+- **Optimization Guides**: Tips for efficient video processing with VAEs
+- **ArXiv Paper**: Wan: Open and Advanced Large-Scale Video Generative Models
+### Compatibility
+- **Required Libraries**: `torch>=2.0.0`, `diffusers>=0.21.0`, `transformers>=4.30.0`
+- **Compatible With**: WAN 2.5 video generation models, custom video pipelines
+- **Integration Examples**: Check Diffusers documentation for VAE integration patterns
+- **Hardware**: NVIDIA GPUs with CUDA 11.8+ or 12.1+, AMD ROCm support may vary
+## Technical Support
+For technical issues, questions, or contributions:
+1. **Model Issues**: Report to WAN-AI Hugging Face repository
+2. **Integration Questions**: Consult Diffusers documentation and community
+3. **Performance Optimization**: Check PyTorch performance tuning guides
+4. **Local Setup**: Verify CUDA installation and GPU compatibility
+5. **Community Support**: WAN Discord/Forum (check official website)
+## Troubleshooting
+### Common Issues
+**Model Not Found Error:**
+```python
+# Ensure model files are downloaded to correct path
+# Expected location: E:\huggingface\wan25-vae\vae\wan\
+```
+**VRAM Out of Memory:**
+```python
+# Reduce batch size, enable model CPU offloading
+vae.enable_model_cpu_offload()
+# Use FP16 precision
+vae = vae.half()
+```
+**Slow Inference Speed:**
+```python
+# Enable xFormers and model compilation
+vae.enable_xformers_memory_efficient_attention()
+vae = torch.compile(vae)
+```
+---
+**Version**: v1.0
+**Last Updated**: 2025-10-13
+**Model Format**: SafeTensors (when available)
+**Repository Status**: Placeholder - Awaiting model download
+**Expected Model Size**: ~1.5-2.0 GB
+## Changelog
+### v1.0 (Initial Documentation - 2025-10-13)
+- Initial placeholder documentation for WAN25-VAE repository
+- Comprehensive usage examples based on WAN 2.1/2.2 patterns
+- Hardware requirements and optimization guidelines
+- Integration examples with Diffusers library
+- Performance tuning recommendations
+- Directory structure prepared for model download
+- Links to official WAN resources and related models
+### Future Updates
+- Add actual model file documentation when WAN 2.5 VAE is released
+- Update specifications with confirmed architecture details
+- Add benchmark results and performance comparisons
+- Include official usage examples from WAN team
+- Document any audio-visual integration features