wangkanai commited on
Commit
00697ac
·
verified ·
1 Parent(s): 708f9ec

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +437 -0
README.md ADDED
@@ -0,0 +1,437 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- README Version: v1.0 -->
2
+
3
+ ---
4
+ license: other
5
+ license_name: wan-license
6
+ library_name: diffusers
7
+ pipeline_tag: text-to-video
8
+ tags:
9
+ - video-generation
10
+ - vae
11
+ - wan
12
+ - autoencoder
13
+ - latent-space
14
+ - video-compression
15
+ - wan2.5
16
+ base_model: Wan-AI/Wan2.5
17
+ base_model_relation: component
18
+ ---
19
+
20
+ # WAN25 VAE - Video Autoencoder v1.0
21
+
22
+ ⚠️ **Repository Status**: This repository is currently a placeholder for WAN 2.5 VAE models. The directory structure is prepared but model files have not yet been downloaded.
23
+
24
+ High-performance Variational Autoencoder (VAE) component for the WAN 2.5 (World Anything Now) video generation system. This VAE provides efficient latent space encoding and decoding for video content, enabling high-quality video generation with reduced computational requirements.
25
+
26
+ ## Model Description
27
+
28
+ The WAN25-VAE is the next-generation variational autoencoder designed for video content processing in the WAN 2.5 video generation pipeline. Building on the advances of WAN 2.1 and WAN 2.2 VAE architectures, it compresses video frames into a compact latent representation and reconstructs them with high fidelity, enabling efficient text-to-video and image-to-video generation workflows.
29
+
30
+ ### Key Capabilities (Expected)
31
+
32
+ - **Advanced Video Compression**: Efficient encoding of video frames into latent space representations with improved compression ratios
33
+ - **High Fidelity Reconstruction**: Accurate decoding back to pixel space with minimal quality loss
34
+ - **Temporal Coherence**: Enhanced consistency across video frames during encoding/decoding
35
+ - **Memory Efficient**: Reduced VRAM requirements during video generation inference
36
+ - **Compatible Pipeline Integration**: Seamlessly integrates with WAN 2.5 video generation models
37
+ - **Native Audio Support**: Expected integration with audio-visual generation capabilities
38
+
39
+ ### Technical Highlights
40
+
41
+ - Optimized architecture for temporal video data processing with spatio-temporal convolutions
42
+ - 3D causal VAE architecture ensuring temporal coherence
43
+ - Supports various frame rates and resolutions (480P, 720P, 1080P)
44
+ - Expected compression ratio improvements over WAN 2.2 VAE (4×16×16)
45
+ - Low latency encoding/decoding for real-time applications
46
+ - Precision-optimized for stable inference on consumer hardware
47
+
48
+ ### WAN VAE Evolution
49
+
50
+ | Version | Compression Ratio | Key Features |
51
+ |---------|------------------|--------------|
52
+ | **WAN 2.1 VAE** | 4×8×8 (temporal×spatial) | Initial 3D causal VAE, efficient 1080P encoding |
53
+ | **WAN 2.2 VAE** | 4×16×16 | Enhanced compression (64x overall), improved quality |
54
+ | **WAN 2.5 VAE** | TBD | Expected: Audio-visual integration, further optimizations |
55
+
56
+ ## Repository Contents
57
+
58
+ ```
59
+ wan25-vae/
60
+ └── vae/
61
+ └── wan/
62
+ └── (Model files pending download)
63
+ ```
64
+
65
+ **Current Status**: Directory structure prepared, awaiting model file downloads.
66
+
67
+ ### Expected File Structure
68
+
69
+ | File | Expected Size | Description |
70
+ |------|--------------|-------------|
71
+ | `wan25-vae.safetensors` | ~1.5-2.0 GB | WAN25 VAE model weights in safetensors format |
72
+ | `config.json` | ~1-5 KB | Model configuration and architecture parameters |
73
+
74
+ ## Hardware Requirements
75
+
76
+ ### Minimum Requirements (Estimated)
77
+ - **VRAM**: 2-3 GB (VAE inference only)
78
+ - **System RAM**: 4 GB
79
+ - **Disk Space**: 2.5 GB free space
80
+ - **GPU**: CUDA-compatible GPU (NVIDIA) or compatible accelerator
81
+
82
+ ### Recommended Specifications
83
+ - **VRAM**: 6+ GB for comfortable operation with video generation pipeline
84
+ - **System RAM**: 16+ GB
85
+ - **GPU**: NVIDIA RTX 3060 or better, RTX 4060+ recommended
86
+ - **Storage**: SSD for faster model loading
87
+
88
+ ### Performance Notes
89
+ - VAE operations are typically memory-bound rather than compute-bound
90
+ - Larger batch sizes require proportionally more VRAM
91
+ - CPU inference is possible but significantly slower (30-50x)
92
+ - WAN 2.5 may include audio processing requiring additional compute
93
+
94
+ ## Usage Examples
95
+
96
+ ### Basic Usage with Diffusers (Placeholder)
97
+
98
+ ```python
99
+ import torch
100
+ from diffusers import AutoencoderKL
101
+
102
+ # Load the WAN25 VAE (when available)
103
+ vae_path = r"E:\huggingface\wan25-vae\vae\wan"
104
+ vae = AutoencoderKL.from_pretrained(
105
+ vae_path,
106
+ torch_dtype=torch.float16
107
+ )
108
+
109
+ # Move to GPU
110
+ device = "cuda" if torch.cuda.is_available() else "cpu"
111
+ vae = vae.to(device)
112
+
113
+ # Encode video frames to latent space
114
+ # video_frames: tensor of shape [batch, channels, height, width]
115
+ with torch.no_grad():
116
+ latents = vae.encode(video_frames).latent_dist.sample()
117
+ latents = latents * vae.config.scaling_factor
118
+
119
+ # Decode latents back to pixel space
120
+ with torch.no_grad():
121
+ decoded_frames = vae.decode(latents / vae.config.scaling_factor).sample
122
+ ```
123
+
124
+ ### Integration with WAN 2.5 Video Generation Pipeline
125
+
126
+ ```python
127
+ import torch
128
+ from diffusers import DiffusionPipeline
129
+
130
+ # Load WAN 2.5 video generation pipeline with custom VAE
131
+ pipeline = DiffusionPipeline.from_pretrained(
132
+ "Wan-AI/Wan2.5-T2V", # Example WAN 2.5 model path
133
+ vae=vae, # Use the loaded WAN25-VAE
134
+ torch_dtype=torch.float16
135
+ )
136
+ pipeline = pipeline.to("cuda")
137
+
138
+ # Generate video from text prompt
139
+ prompt = "A serene sunset over mountains with flowing clouds and ambient nature sounds"
140
+ video_frames = pipeline(
141
+ prompt=prompt,
142
+ num_frames=48, # WAN 2.5 may support longer sequences
143
+ height=720,
144
+ width=1280,
145
+ num_inference_steps=50
146
+ ).frames
147
+ ```
148
+
149
+ ### Memory-Efficient Video Processing
150
+
151
+ ```python
152
+ import torch
153
+
154
+ # Enable memory-efficient attention for large videos
155
+ vae.enable_xformers_memory_efficient_attention()
156
+
157
+ # Process video in smaller chunks
158
+ def encode_video_chunks(video_tensor, chunk_size=8):
159
+ """Encode video frames in chunks to reduce VRAM usage"""
160
+ latents = []
161
+ for i in range(0, video_tensor.shape[0], chunk_size):
162
+ chunk = video_tensor[i:i+chunk_size].to(device)
163
+ with torch.no_grad():
164
+ chunk_latents = vae.encode(chunk).latent_dist.sample()
165
+ latents.append(chunk_latents.cpu())
166
+ return torch.cat(latents, dim=0)
167
+ ```
168
+
169
+ ### Advanced Latent Space Operations
170
+
171
+ ```python
172
+ import torch
173
+ import numpy as np
174
+
175
+ # Encode input video
176
+ latents = vae.encode(input_frames).latent_dist.sample()
177
+
178
+ # Apply transformations in latent space (e.g., interpolation)
179
+ latents_start = latents[0]
180
+ latents_end = latents[-1]
181
+
182
+ # Create smooth interpolation between frames
183
+ interpolated_latents = []
184
+ for alpha in np.linspace(0, 1, 24):
185
+ interpolated = (1 - alpha) * latents_start + alpha * latents_end
186
+ interpolated_latents.append(interpolated)
187
+
188
+ # Decode interpolated latents
189
+ smooth_video = vae.decode(torch.stack(interpolated_latents)).sample
190
+ ```
191
+
192
+ ## Model Specifications
193
+
194
+ ### Architecture Details (Expected)
195
+ - **Model Type**: Spatio-Temporal Variational Autoencoder (3D Causal VAE)
196
+ - **Architecture**: Convolutional encoder-decoder with KL divergence regularization
197
+ - **Input Format**: Video frames (RGB) with potential audio integration
198
+ - **Latent Dimensions**: Compressed spatial resolution with channel expansion
199
+ - **Temporal Processing**: 3D causal convolutions for temporal coherence
200
+ - **Activation Functions**: Mixed (SiLU, tanh for output)
201
+
202
+ ### Technical Specifications
203
+ - **Format**: SafeTensors (secure, efficient binary format)
204
+ - **Precision**: Mixed precision compatible (FP16/FP32/BF16)
205
+ - **Framework**: PyTorch-based, compatible with Diffusers library
206
+ - **Parameters**: Estimated ~400-500M parameters (based on WAN 2.2 progression)
207
+ - **Compression Ratio**: Expected improvements over WAN 2.2's 4×16×16
208
+ - **Perceptual Optimization**: Pre-trained perceptual networks for quality preservation
209
+
210
+ ### Supported Input Resolutions
211
+ - **Standard**: 480P (854×480), 720P (1280×720), 1080P (1920×1080)
212
+ - **Aspect Ratios**: 16:9, 4:3, 1:1, and custom ratios
213
+ - **Frame Rates**: 24fps, 30fps, 60fps support expected
214
+
215
+ ## Performance Tips and Optimization
216
+
217
+ ### Memory Optimization
218
+ ```python
219
+ # Enable gradient checkpointing for training (if fine-tuning)
220
+ vae.enable_gradient_checkpointing()
221
+
222
+ # Use float16 for inference to reduce VRAM usage
223
+ vae = vae.half()
224
+
225
+ # Process frames in batches
226
+ batch_size = 4 # Adjust based on available VRAM
227
+
228
+ # Enable CPU offloading for large models
229
+ vae.enable_model_cpu_offload()
230
+ ```
231
+
232
+ ### Speed Optimization
233
+ ```python
234
+ # Compile model with torch.compile (PyTorch 2.0+)
235
+ vae = torch.compile(vae, mode="reduce-overhead")
236
+
237
+ # Use channels_last memory format for better performance
238
+ vae = vae.to(memory_format=torch.channels_last)
239
+
240
+ # Enable TF32 on Ampere+ GPUs
241
+ torch.backends.cuda.matmul.allow_tf32 = True
242
+ torch.backends.cudnn.allow_tf32 = True
243
+
244
+ # Use xFormers for memory-efficient attention
245
+ vae.enable_xformers_memory_efficient_attention()
246
+ ```
247
+
248
+ ### Quality vs Speed Trade-offs
249
+ - **High Quality**: Use FP32 precision, larger batch sizes, disable tiling
250
+ - **Balanced**: FP16 precision, moderate batch sizes (4-8 frames)
251
+ - **Fast Inference**: FP16 precision, smaller batches (1-2 frames), enable tiling
252
+ - **Ultra Fast**: BF16 precision, aggressive tiling, model compilation
253
+
254
+ ### Best Practices
255
+ - Always use safetensors format for security and compatibility
256
+ - Monitor VRAM usage with `torch.cuda.memory_allocated()`
257
+ - Clear cache between large operations: `torch.cuda.empty_cache()`
258
+ - Use mixed precision training if fine-tuning the VAE
259
+ - Validate reconstruction quality with perceptual metrics (LPIPS, SSIM, PSNR)
260
+ - Consider using video-specific quality metrics (VMAF, VQM)
261
+
262
+ ## Getting Started
263
+
264
+ ### Step 1: Download WAN 2.5 VAE Model
265
+
266
+ When WAN 2.5 VAE becomes available, download from Hugging Face:
267
+
268
+ ```bash
269
+ # Using huggingface_hub
270
+ from huggingface_hub import snapshot_download
271
+
272
+ snapshot_download(
273
+ repo_id="Wan-AI/Wan2.5-VAE", # Check official repo name
274
+ local_dir="E:/huggingface/wan25-vae/vae/wan",
275
+ allow_patterns=["*.safetensors", "*.json"]
276
+ )
277
+ ```
278
+
279
+ Or use git-lfs:
280
+
281
+ ```bash
282
+ cd E:/huggingface/wan25-vae/vae/wan
283
+ git lfs install
284
+ git clone https://huggingface.co/Wan-AI/Wan2.5-VAE .
285
+ ```
286
+
287
+ ### Step 2: Install Dependencies
288
+
289
+ ```bash
290
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
291
+ pip install diffusers transformers accelerate xformers safetensors
292
+ ```
293
+
294
+ ### Step 3: Verify Installation
295
+
296
+ ```python
297
+ import torch
298
+ from diffusers import AutoencoderKL
299
+
300
+ # Check if model files exist
301
+ import os
302
+ vae_path = r"E:\huggingface\wan25-vae\vae\wan"
303
+ if os.path.exists(os.path.join(vae_path, "config.json")):
304
+ print("✓ WAN25 VAE model found")
305
+ vae = AutoencoderKL.from_pretrained(vae_path)
306
+ print(f"✓ Model loaded successfully with {sum(p.numel() for p in vae.parameters())/1e6:.1f}M parameters")
307
+ else:
308
+ print("✗ WAN25 VAE model not found. Please download first.")
309
+ ```
310
+
311
+ ## License
312
+
313
+ This model is released under a custom WAN license. Please review the license terms before use:
314
+
315
+ - **Commercial Use**: Subject to WAN license terms and conditions
316
+ - **Research Use**: Generally permitted with proper attribution
317
+ - **Redistribution**: Refer to original WAN model license
318
+ - **Modifications**: Check license for derivative work permissions
319
+
320
+ For complete license details, refer to the official WAN model repository or license documentation at:
321
+ - https://huggingface.co/Wan-AI
322
+ - https://wan.video/
323
+
324
+ ## Citation
325
+
326
+ If you use this VAE in your research or projects, please cite:
327
+
328
+ ```bibtex
329
+ @misc{wan25-vae,
330
+ title={WAN25 VAE: Advanced Video Variational Autoencoder for WAN 2.5 Video Generation},
331
+ author={WAN Model Team},
332
+ year={2025},
333
+ publisher={Hugging Face},
334
+ howpublished={\url{https://huggingface.co/Wan-AI/Wan2.5-VAE}}
335
+ }
336
+ ```
337
+
338
+ For the broader WAN 2.5 system:
339
+
340
+ ```bibtex
341
+ @article{wan2025,
342
+ title={Wan: Open and Advanced Large-Scale Video Generative Models},
343
+ author={WAN Research Team},
344
+ journal={arXiv preprint},
345
+ year={2025}
346
+ }
347
+ ```
348
+
349
+ ## Related Resources
350
+
351
+ ### Official Links
352
+ - **WAN Official Website**: [https://wan.video/](https://wan.video/)
353
+ - **WAN 2.5 Announcement**: [https://wan25.ai/](https://wan25.ai/)
354
+ - **Hugging Face Organization**: [https://huggingface.co/Wan-AI](https://huggingface.co/Wan-AI)
355
+ - **GitHub Repository**: [https://github.com/Wan-Video](https://github.com/Wan-Video)
356
+ - **Diffusers Documentation**: [https://huggingface.co/docs/diffusers](https://huggingface.co/docs/diffusers)
357
+ - **Model Hub**: [https://huggingface.co/models](https://huggingface.co/models)
358
+
359
+ ### Related WAN Models (Local Repository)
360
+ - **WAN 2.1 VAE**: `E:\huggingface\wan21-vae\` - Previous generation VAE
361
+ - **WAN 2.2 VAE**: `E:\huggingface\wan22-vae\` - Current generation VAE (1.4 GB)
362
+ - **WAN 2.5 FP16**: `E:\huggingface\wan25-fp16\` - Main model in FP16 precision
363
+ - **WAN 2.5 FP8**: `E:\huggingface\wan25-fp8\` - Optimized FP8 variant
364
+ - **WAN 2.5 LoRAs**: `E:\huggingface\wan25-fp16-loras\` - Enhancement modules
365
+
366
+ ### Community Resources
367
+ - **WAN Community**: Discussions and examples for WAN video generation
368
+ - **Video Generation Papers**: Research on video diffusion and VAE architectures
369
+ - **Optimization Guides**: Tips for efficient video processing with VAEs
370
+ - **ArXiv Paper**: Wan: Open and Advanced Large-Scale Video Generative Models
371
+
372
+ ### Compatibility
373
+ - **Required Libraries**: `torch>=2.0.0`, `diffusers>=0.21.0`, `transformers>=4.30.0`
374
+ - **Compatible With**: WAN 2.5 video generation models, custom video pipelines
375
+ - **Integration Examples**: Check Diffusers documentation for VAE integration patterns
376
+ - **Hardware**: NVIDIA GPUs with CUDA 11.8+ or 12.1+, AMD ROCm support may vary
377
+
378
+ ## Technical Support
379
+
380
+ For technical issues, questions, or contributions:
381
+
382
+ 1. **Model Issues**: Report to WAN-AI Hugging Face repository
383
+ 2. **Integration Questions**: Consult Diffusers documentation and community
384
+ 3. **Performance Optimization**: Check PyTorch performance tuning guides
385
+ 4. **Local Setup**: Verify CUDA installation and GPU compatibility
386
+ 5. **Community Support**: WAN Discord/Forum (check official website)
387
+
388
+ ## Troubleshooting
389
+
390
+ ### Common Issues
391
+
392
+ **Model Not Found Error:**
393
+ ```python
394
+ # Ensure model files are downloaded to correct path
395
+ # Expected location: E:\huggingface\wan25-vae\vae\wan\
396
+ ```
397
+
398
+ **VRAM Out of Memory:**
399
+ ```python
400
+ # Reduce batch size, enable model CPU offloading
401
+ vae.enable_model_cpu_offload()
402
+ # Use FP16 precision
403
+ vae = vae.half()
404
+ ```
405
+
406
+ **Slow Inference Speed:**
407
+ ```python
408
+ # Enable xFormers and model compilation
409
+ vae.enable_xformers_memory_efficient_attention()
410
+ vae = torch.compile(vae)
411
+ ```
412
+
413
+ ---
414
+
415
+ **Version**: v1.0
416
+ **Last Updated**: 2025-10-13
417
+ **Model Format**: SafeTensors (when available)
418
+ **Repository Status**: Placeholder - Awaiting model download
419
+ **Expected Model Size**: ~1.5-2.0 GB
420
+
421
+ ## Changelog
422
+
423
+ ### v1.0 (Initial Documentation - 2025-10-13)
424
+ - Initial placeholder documentation for WAN25-VAE repository
425
+ - Comprehensive usage examples based on WAN 2.1/2.2 patterns
426
+ - Hardware requirements and optimization guidelines
427
+ - Integration examples with Diffusers library
428
+ - Performance tuning recommendations
429
+ - Directory structure prepared for model download
430
+ - Links to official WAN resources and related models
431
+
432
+ ### Future Updates
433
+ - Add actual model file documentation when WAN 2.5 VAE is released
434
+ - Update specifications with confirmed architecture details
435
+ - Add benchmark results and performance comparisons
436
+ - Include official usage examples from WAN team
437
+ - Document any audio-visual integration features