wangkanai commited on
Commit
2e108f0
·
verified ·
1 Parent(s): 3a69f35

Add files using upload-large-folder tool

Browse files
Files changed (2) hide show
  1. README.md +301 -0
  2. vae/wan/wan22-vae.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- README Version: v1.0 -->
2
+
3
+ ---
4
+ license: other
5
+ license_name: wan-license
6
+ library_name: diffusers
7
+ pipeline_tag: text-to-video
8
+ tags:
9
+ - video-generation
10
+ - vae
11
+ - wan
12
+ - autoencoder
13
+ - latent-space
14
+ - video-compression
15
+ base_model: wan-model/wan
16
+ base_model_relation: component
17
+ ---
18
+
19
+ # WAN22 VAE - Video Autoencoder v1.0
20
+
21
+ High-performance Variational Autoencoder (VAE) component for the WAN (World Anything Now) video generation system. This VAE provides efficient latent space encoding and decoding for video content, enabling high-quality video generation with reduced computational requirements.
22
+
23
+ ## Model Description
24
+
25
+ The WAN22-VAE is a specialized variational autoencoder designed for video content processing in the WAN video generation pipeline. It compresses video frames into a compact latent representation and reconstructs them with high fidelity, enabling efficient text-to-video and image-to-video generation workflows.
26
+
27
+ ### Key Capabilities
28
+
29
+ - **Video Compression**: Efficient encoding of video frames into latent space representations
30
+ - **High Fidelity Reconstruction**: Accurate decoding back to pixel space with minimal quality loss
31
+ - **Temporal Coherence**: Maintains consistency across video frames during encoding/decoding
32
+ - **Memory Efficient**: Reduces VRAM requirements during video generation inference
33
+ - **Compatible Pipeline Integration**: Seamlessly integrates with WAN video generation models
34
+
35
+ ### Technical Highlights
36
+
37
+ - Optimized architecture for temporal video data processing
38
+ - Supports various frame rates and resolutions
39
+ - Low latency encoding/decoding for real-time applications
40
+ - Precision-optimized for stable inference on consumer hardware
41
+
42
+ ## Repository Contents
43
+
44
+ ```
45
+ wan22-vae/
46
+ └── vae/
47
+ └── wan/
48
+ └── wan22-vae.safetensors # 1.34 GB - Main VAE model weights
49
+ ```
50
+
51
+ **Total Repository Size**: ~1.4 GB
52
+
53
+ ### File Details
54
+
55
+ | File | Size | Description |
56
+ |------|------|-------------|
57
+ | `wan22-vae.safetensors` | 1.34 GB | WAN22 VAE model weights in safetensors format |
58
+
59
+ ## Hardware Requirements
60
+
61
+ ### Minimum Requirements
62
+ - **VRAM**: 2 GB (VAE inference only)
63
+ - **System RAM**: 4 GB
64
+ - **Disk Space**: 1.5 GB free space
65
+ - **GPU**: CUDA-compatible GPU (NVIDIA) or compatible accelerator
66
+
67
+ ### Recommended Specifications
68
+ - **VRAM**: 4+ GB for comfortable operation with video generation pipeline
69
+ - **System RAM**: 16+ GB
70
+ - **GPU**: NVIDIA RTX 3060 or better
71
+ - **Storage**: SSD for faster model loading
72
+
73
+ ### Performance Notes
74
+ - VAE operations are typically memory-bound rather than compute-bound
75
+ - Larger batch sizes require proportionally more VRAM
76
+ - CPU inference is possible but significantly slower (30-50x)
77
+
78
+ ## Usage Examples
79
+
80
+ ### Basic Usage with Diffusers
81
+
82
+ ```python
83
+ import torch
84
+ from diffusers import AutoencoderKL
85
+
86
+ # Load the WAN22 VAE
87
+ vae_path = r"E:\huggingface\wan22-vae\vae\wan"
88
+ vae = AutoencoderKL.from_pretrained(
89
+ vae_path,
90
+ torch_dtype=torch.float16
91
+ )
92
+
93
+ # Move to GPU
94
+ device = "cuda" if torch.cuda.is_available() else "cpu"
95
+ vae = vae.to(device)
96
+
97
+ # Encode video frames to latent space
98
+ # video_frames: tensor of shape [batch, channels, height, width]
99
+ with torch.no_grad():
100
+ latents = vae.encode(video_frames).latent_dist.sample()
101
+ latents = latents * vae.config.scaling_factor
102
+
103
+ # Decode latents back to pixel space
104
+ with torch.no_grad():
105
+ decoded_frames = vae.decode(latents / vae.config.scaling_factor).sample
106
+ ```
107
+
108
+ ### Integration with WAN Video Generation Pipeline
109
+
110
+ ```python
111
+ import torch
112
+ from diffusers import DiffusionPipeline
113
+
114
+ # Load WAN video generation pipeline with custom VAE
115
+ pipeline = DiffusionPipeline.from_pretrained(
116
+ "wan-model/wan-base", # Replace with actual WAN model path
117
+ vae=vae, # Use the loaded WAN22-VAE
118
+ torch_dtype=torch.float16
119
+ )
120
+ pipeline = pipeline.to("cuda")
121
+
122
+ # Generate video from text prompt
123
+ prompt = "A serene sunset over mountains with flowing clouds"
124
+ video_frames = pipeline(
125
+ prompt=prompt,
126
+ num_frames=24,
127
+ height=512,
128
+ width=512,
129
+ num_inference_steps=50
130
+ ).frames
131
+ ```
132
+
133
+ ### Memory-Efficient Video Processing
134
+
135
+ ```python
136
+ import torch
137
+
138
+ # Enable memory-efficient attention for large videos
139
+ vae.enable_xformers_memory_efficient_attention()
140
+
141
+ # Process video in smaller chunks
142
+ def encode_video_chunks(video_tensor, chunk_size=8):
143
+ """Encode video frames in chunks to reduce VRAM usage"""
144
+ latents = []
145
+ for i in range(0, video_tensor.shape[0], chunk_size):
146
+ chunk = video_tensor[i:i+chunk_size].to(device)
147
+ with torch.no_grad():
148
+ chunk_latents = vae.encode(chunk).latent_dist.sample()
149
+ latents.append(chunk_latents.cpu())
150
+ return torch.cat(latents, dim=0)
151
+ ```
152
+
153
+ ### Custom Latent Space Manipulation
154
+
155
+ ```python
156
+ import torch
157
+ import numpy as np
158
+
159
+ # Encode input video
160
+ latents = vae.encode(input_frames).latent_dist.sample()
161
+
162
+ # Apply transformations in latent space (e.g., interpolation)
163
+ latents_start = latents[0]
164
+ latents_end = latents[-1]
165
+
166
+ # Create smooth interpolation between frames
167
+ interpolated_latents = []
168
+ for alpha in np.linspace(0, 1, 16):
169
+ interpolated = (1 - alpha) * latents_start + alpha * latents_end
170
+ interpolated_latents.append(interpolated)
171
+
172
+ # Decode interpolated latents
173
+ smooth_video = vae.decode(torch.stack(interpolated_latents)).sample
174
+ ```
175
+
176
+ ## Model Specifications
177
+
178
+ ### Architecture Details
179
+ - **Model Type**: Variational Autoencoder (VAE)
180
+ - **Architecture**: Convolutional encoder-decoder with KL divergence regularization
181
+ - **Input Format**: Video frames (RGB or grayscale)
182
+ - **Latent Dimensions**: Compressed spatial resolution with channel expansion
183
+ - **Activation Functions**: Mixed (SiLU, tanh for output)
184
+
185
+ ### Technical Specifications
186
+ - **Format**: SafeTensors (secure, efficient binary format)
187
+ - **Precision**: Mixed precision compatible (FP16/FP32)
188
+ - **Framework**: PyTorch-based, compatible with Diffusers library
189
+ - **Parameters**: ~335M parameters (1.34 GB in FP32)
190
+ - **Compression Ratio**: Approximately 8x spatial compression per dimension
191
+
192
+ ### Supported Input Resolutions
193
+ - **Standard**: 512x512, 768x768
194
+ - **Extended**: 256x256 to 1024x1024 (depending on VRAM)
195
+ - **Aspect Ratios**: Square and common video ratios (16:9, 4:3)
196
+
197
+ ## Performance Tips and Optimization
198
+
199
+ ### Memory Optimization
200
+ ```python
201
+ # Enable gradient checkpointing for training (if fine-tuning)
202
+ vae.enable_gradient_checkpointing()
203
+
204
+ # Use float16 for inference to reduce VRAM usage
205
+ vae = vae.half()
206
+
207
+ # Process frames in batches
208
+ batch_size = 4 # Adjust based on available VRAM
209
+ ```
210
+
211
+ ### Speed Optimization
212
+ ```python
213
+ # Compile model with torch.compile (PyTorch 2.0+)
214
+ vae = torch.compile(vae, mode="reduce-overhead")
215
+
216
+ # Use channels_last memory format for better performance
217
+ vae = vae.to(memory_format=torch.channels_last)
218
+
219
+ # Enable TF32 on Ampere+ GPUs
220
+ torch.backends.cuda.matmul.allow_tf32 = True
221
+ torch.backends.cudnn.allow_tf32 = True
222
+ ```
223
+
224
+ ### Quality vs Speed Trade-offs
225
+ - **High Quality**: Use FP32 precision, larger batch sizes, disable tiling
226
+ - **Balanced**: FP16 precision, moderate batch sizes (4-8 frames)
227
+ - **Fast Inference**: FP16 precision, smaller batches (1-2 frames), enable tiling
228
+
229
+ ### Best Practices
230
+ - Always use safetensors format for security and compatibility
231
+ - Monitor VRAM usage with `torch.cuda.memory_allocated()`
232
+ - Clear cache between large operations: `torch.cuda.empty_cache()`
233
+ - Use mixed precision training if fine-tuning the VAE
234
+ - Validate reconstruction quality with perceptual metrics (LPIPS, SSIM)
235
+
236
+ ## License
237
+
238
+ This model is released under a custom WAN license. Please review the license terms before use:
239
+
240
+ - **Commercial Use**: Subject to WAN license terms
241
+ - **Research Use**: Generally permitted with attribution
242
+ - **Redistribution**: Refer to original WAN model license
243
+ - **Modifications**: Check license for derivative work permissions
244
+
245
+ For complete license details, refer to the original WAN model repository or license documentation.
246
+
247
+ ## Citation
248
+
249
+ If you use this VAE in your research or projects, please cite:
250
+
251
+ ```bibtex
252
+ @misc{wan22-vae,
253
+ title={WAN22 VAE: Video Variational Autoencoder for WAN Video Generation},
254
+ author={WAN Model Team},
255
+ year={2024},
256
+ publisher={Hugging Face},
257
+ howpublished={\url{https://huggingface.co/wan-model/wan22-vae}}
258
+ }
259
+ ```
260
+
261
+ ## Related Resources
262
+
263
+ ### Official Links
264
+ - **WAN Base Model**: [WAN Model Repository](https://huggingface.co/wan-model)
265
+ - **Diffusers Documentation**: [https://huggingface.co/docs/diffusers](https://huggingface.co/docs/diffusers)
266
+ - **Model Hub**: [https://huggingface.co/models](https://huggingface.co/models)
267
+
268
+ ### Community Resources
269
+ - **WAN Community**: Discussions and examples for WAN video generation
270
+ - **Video Generation Papers**: Research on video diffusion and VAE architectures
271
+ - **Optimization Guides**: Tips for efficient video processing with VAEs
272
+
273
+ ### Compatibility
274
+ - **Required Libraries**: `torch>=2.0.0`, `diffusers>=0.21.0`, `transformers`
275
+ - **Compatible With**: WAN video generation models, custom video pipelines
276
+ - **Integration Examples**: Check Diffusers documentation for VAE integration patterns
277
+
278
+ ## Technical Support
279
+
280
+ For technical issues, questions, or contributions:
281
+
282
+ 1. **Model Issues**: Report to original WAN model repository
283
+ 2. **Integration Questions**: Consult Diffusers documentation and community
284
+ 3. **Performance Optimization**: Check PyTorch performance tuning guides
285
+ 4. **Local Setup**: Verify CUDA installation and GPU compatibility
286
+
287
+ ---
288
+
289
+ **Version**: v1.0
290
+ **Last Updated**: 2025-10-13
291
+ **Model Format**: SafeTensors
292
+ **Total Size**: 1.4 GB
293
+
294
+ ## Changelog
295
+
296
+ ### v1.0 (Initial Release)
297
+ - Initial documentation for WAN22-VAE model
298
+ - Comprehensive usage examples for video encoding/decoding
299
+ - Hardware requirements and optimization guidelines
300
+ - Integration examples with Diffusers library
301
+ - Performance tuning recommendations
vae/wan/wan22-vae.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e40321bd36b9709991dae2530eb4ac303dd168276980d3e9bc4b6e2b75fed156
3
+ size 1409400960