| --- |
| language: |
| - en |
| license: other |
| license_name: wan-license |
| library_name: diffusers |
| pipeline_tag: image-to-video |
| tags: |
| - video-generation |
| - image-to-video |
| - text-to-video |
| - diffusion |
| - video-diffusion |
| - camera-control |
| - lora |
| - wan |
| - wan22 |
| - fp8 |
| - quantized |
| - gguf |
| base_model: wan22 |
| base_model_relation: quantized |
| inference: true |
| model-index: |
| - name: WAN 2.2 FP8/GGUF - I2V/T2V Models |
| results: |
| - task: |
| type: image-to-video |
| name: Image-to-Video Generation |
| metrics: |
| - name: Inference Steps |
| type: steps |
| value: 50 |
| verified: false |
| - name: VRAM Usage (FP8) |
| type: memory_gb |
| value: 16 |
| verified: false |
| - task: |
| type: text-to-video |
| name: Text-to-Video Generation |
| metrics: |
| - name: Inference Steps |
| type: steps |
| value: 50 |
| verified: false |
| - name: VRAM Usage (FP8) |
| type: memory_gb |
| value: 16 |
| verified: false |
| --- |
| |
| |
|
|
| High-quality image-to-video (I2V) and text-to-video (T2V) generation models in FP8 and GGUF quantized formats, with advanced camera control and enhancement LoRAs for memory-efficient deployment. |
|
|
| |
|
|
| WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization and GGUF formats for efficient deployment on consumer-grade hardware. This repository contains FP8 and GGUF quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models. |
|
|
| **Key Features**: |
| - 14B parameter diffusion-based architecture |
| - FP8 and GGUF quantized formats for memory efficiency (~50% smaller than FP16) |
| - Dedicated VAE for video latent encoding/decoding |
| - Extensive LoRA ecosystem for camera control (v2) and visual enhancement |
| - Support for both high-noise (creative) and low-noise (faithful) generation modes |
| - Text-to-video and image-to-video capabilities |
|
|
| **Model Statistics**: |
| - **Total Repository Size**: ~89GB |
| - **Model Architecture**: Diffusion-based video generation |
| - **Supported Formats**: `.safetensors` (FP8), `.gguf` (Q4/Q8 quantized) |
| - **Parameters**: 14 billion |
| - **Precision**: FP8 E4M3FN and GGUF Q4/Q8 quantization |
| - **Input**: Text prompts and/or images |
| - **Output**: Video sequences (typically 16-24 frames) |
|
|
| |
|
|
| Quick start example for image-to-video generation with FP8: |
|
|
| ```python |
| from diffusers import DiffusionPipeline, AutoencoderKL |
| import torch |
| from PIL import Image |
|
|
| |
| input_image = Image.open("your_image.jpg") |
|
|
| |
| pipe = DiffusionPipeline.from_pretrained( |
| "base-model-path", |
| torch_dtype=torch.float8_e4m3fn |
| ) |
|
|
| |
| pipe.vae = AutoencoderKL.from_single_file( |
| "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors" |
| ) |
|
|
| |
| pipe.unet = torch.load( |
| "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors" |
| ) |
|
|
| pipe.to("cuda") |
|
|
| |
| video = pipe( |
| image=input_image, |
| prompt="cinematic shot, high quality", |
| num_inference_steps=50, |
| num_frames=16 |
| ).frames |
| ``` |
|
|
| For detailed usage examples including camera control, GGUF models, and LoRA combinations, see the [Usage](#usage) section below. |
|
|
| |
|
|
| ``` |
| wan22-fp8/ |
| ├── diffusion_models/wan/ |
| ├── loras/wan/ |
| └── vae/wan/ |
| ``` |
|
|
| |
|
|
| |
|
|
| Located in `diffusion_models/wan/` |
|
|
| |
| | Model | Precision | Size | VRAM Required | Use Case | |
| |-------|-----------|------|---------------|----------| |
| | `wan22-t2v-high-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, high noise schedule | |
| | `wan22-t2v-low-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, low noise schedule | |
|
|
| |
|
|
| **FP8 Precision** (Balanced Quality/Performance): |
| | Model | Size | VRAM Required | Description | |
| |-------|------|---------------|-------------| |
| | `wan22-i2v-high-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Creative generation, higher variance | |
| | `wan22-i2v-low-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Faithful reproduction, consistent results | |
|
|
| **GGUF Quantized** (Memory Efficient): |
| | Model | Size | VRAM Required | Quantization | Description | |
| |-------|------|---------------|--------------|-------------| |
| | `wan22-i2v-a14b-highnoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, high-noise | |
| | `wan22-i2v-a14b-lownoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, low-noise | |
| | `wan22-i2v-a14b-gguf-a14b-high.gguf` | 15GB | 16GB+ | Q8 | Higher precision quantization | |
|
|
| |
|
|
| Located in `vae/wan/` |
|
|
| - **File**: `wan22-vae.safetensors` |
| - **Size**: 1.4GB |
| - **Purpose**: Video latent encoder/decoder for compressing video frames |
|
|
| |
|
|
| Located in `loras/wan/` |
|
|
| |
| | LoRA | Size | Description | Prompt Examples | |
| |------|------|-------------|-----------------| |
| | `wan22-camera-rotation-rank16-v2.safetensors` | 293MB | Rotating camera movements | "rotating camera", "camera circles around subject" | |
| | `wan22-camera-arcshot-rank16-v2-high.safetensors` | 293MB | Cinematic arc shots | "arc shot", "curved camera movement" | |
| | `wan22-camera-drone-rank16-v2.safetensors` | 293MB | Aerial drone perspectives | "aerial view", "drone shot", "bird's eye view" | |
| | `wan22-camera-adr1a-v1.safetensors` | 293MB | Advanced camera control | Custom camera trajectories | |
| | `wan22-camera-earthzoomout.safetensors` | 293MB | Earth zoom-out effects | "zooming out from earth", "planet zoom" | |
|
|
| |
| | LoRA | Size | Purpose | Effect | |
| |------|------|---------|--------| |
| | `wan22-face-naturalizer.safetensors` | 586MB | Face enhancement | More natural-looking facial movements | |
| | `wan22-light-volumetric.safetensors` | 293MB | Lighting effects | Volumetric lighting, god rays, atmospheric effects | |
| | `wan22-light-cinematicflare-i2v-low.safetensors` | 293MB | Lens flare effects | Cinematic lens flares and light blooms for I2V | |
| | `wan22-upscale-realismboost-t2v-14b.safetensors` | 293MB | Quality boost | Enhanced realism for T2V generation | |
|
|
| |
| | LoRA | Size | Action Type | Application | |
| |------|------|-------------|-------------| |
| | `wan22-action-wink-i2v-v1-low-noise.safetensors` | 147MB | Facial actions | Controlled winking animations | |
|
|
| |
|
|
| |
|
|
| ```python |
| from diffusers import DiffusionPipeline, AutoencoderKL |
| import torch |
| from PIL import Image |
|
|
| |
| input_image = Image.open("path/to/your/image.jpg") |
|
|
| |
| pipe = DiffusionPipeline.from_pretrained( |
| "path-to-base-model", |
| torch_dtype=torch.float8_e4m3fn |
| ) |
|
|
| |
| pipe.unet = torch.load( |
| "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors" |
| ) |
|
|
| |
| pipe.vae = AutoencoderKL.from_single_file( |
| "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors" |
| ) |
|
|
| pipe.to("cuda") |
|
|
| |
| video = pipe( |
| image=input_image, |
| prompt="cinematic shot, high quality", |
| num_inference_steps=50, |
| num_frames=16 |
| ).frames |
|
|
| |
| from diffusers.utils import export_to_video |
| export_to_video(video, "output.mp4", fps=8) |
| ``` |
|
|
| |
|
|
| ```python |
| from diffusers import DiffusionPipeline, AutoencoderKL |
| import torch |
|
|
| |
| pipe = DiffusionPipeline.from_pretrained( |
| "path-to-base-model", |
| torch_dtype=torch.float8_e4m3fn |
| ) |
|
|
| |
| pipe.unet = torch.load( |
| "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-t2v-low-noise-14b-fp8-scaled.safetensors" |
| ) |
|
|
| |
| pipe.vae = AutoencoderKL.from_single_file( |
| "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors" |
| ) |
|
|
| pipe.to("cuda") |
|
|
| |
| video = pipe( |
| prompt="a cat walking through a garden, high quality, cinematic", |
| num_inference_steps=50, |
| num_frames=16 |
| ).frames |
| ``` |
|
|
| |
|
|
| ```python |
| |
| pipe.load_lora_weights( |
| "E:/huggingface/wan22-fp8/loras/wan/wan22-camera-rotation-rank16-v2.safetensors" |
| ) |
|
|
| |
| video = pipe( |
| image=input_image, |
| prompt="rotating camera around a sculpture", |
| num_inference_steps=50 |
| ).frames |
| ``` |
|
|
| |
|
|
| ```python |
| |
| pipe.load_lora_weights( |
| "E:/huggingface/wan22-fp8/loras/wan/wan22-camera-drone-rank16-v2.safetensors", |
| adapter_name="camera_drone" |
| ) |
| pipe.load_lora_weights( |
| "E:/huggingface/wan22-fp8/loras/wan/wan22-light-volumetric.safetensors", |
| adapter_name="volumetric_light" |
| ) |
|
|
| |
| pipe.set_adapters(["camera_drone", "volumetric_light"], adapter_weights=[0.8, 0.6]) |
|
|
| |
| video = pipe( |
| image=input_image, |
| prompt="aerial drone shot with volumetric lighting at sunset", |
| num_inference_steps=50 |
| ).frames |
| ``` |
|
|
| |
|
|
| ```python |
| |
| pipe.load_lora_weights( |
| "E:/huggingface/wan22-fp8/loras/wan/wan22-light-cinematicflare-i2v-low.safetensors" |
| ) |
|
|
| |
| video = pipe( |
| image=input_image, |
| prompt="cinematic lens flare, light bloom, professional cinematography", |
| num_inference_steps=50 |
| ).frames |
| ``` |
|
|
| |
|
|
| |
|
|
| **FP8 Models** (Available in this repo): |
| - ✅ 50% smaller than FP16 (14GB vs 27GB) |
| - ✅ Minimal quality loss compared to FP16 |
| - ✅ Faster inference on GPUs with tensor cores |
| - ✅ Balanced quality/performance |
| - ❌ Requires 16GB+ VRAM |
| - 🎯 Use for: Production deployment, most users, balanced quality |
|
|
| **GGUF Q4_K_S** (Available in this repo): |
| - ✅ Smallest size (8.2GB) |
| - ✅ Works on 12GB VRAM GPUs |
| - ✅ Fastest inference |
| - ❌ More quality degradation than FP8 |
| - 🎯 Use for: Memory-constrained systems, rapid prototyping, testing |
|
|
| **GGUF Q8** (Available in this repo): |
| - ✅ Medium size (15GB) |
| - ✅ Better quality than Q4 |
| - ✅ Works on 16GB VRAM GPUs |
| - 🎯 Use for: Balance between Q4 and FP8 quality |
|
|
| **FP16 Models** (Not in this repo): |
| - See separate wan22-fp16 repository for full precision variants |
| - 27GB per model, requires 24GB+ VRAM |
| - Maximum quality for research and archival use |
|
|
| |
|
|
| **High-Noise Models**: |
| - More creative interpretation |
| - Better for abstract or stylized content |
| - Higher variance in outputs |
|
|
| **Low-Noise Models**: |
| - More faithful to input/prompt |
| - Better for realistic content |
| - More consistent results |
|
|
| |
|
|
| | Model Type | Minimum VRAM | Recommended VRAM | GPU Examples | |
| |------------|--------------|------------------|--------------| |
| | I2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090, RTX 4070 Ti Super | |
| | I2V GGUF Q4 | 12GB | 16GB+ | RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB | |
| | I2V GGUF Q8 | 16GB | 20GB+ | RTX 4080, RTX 3090 | |
| | T2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090 | |
|
|
| **System Requirements**: |
| - CUDA 11.8+ or 12.1+ |
| - PyTorch 2.1+ (with FP8 support) |
| - diffusers library 0.20+ |
| - 89GB free disk space (full repository) |
| - 32GB+ system RAM recommended |
|
|
| |
|
|
| 1. **Memory Optimization**: |
| - Start with GGUF Q4 models on 12GB GPUs |
| - Use FP8 models for 16GB+ GPUs (best quality/VRAM balance) |
| - Enable `torch.cuda.amp` for mixed precision if needed |
| - Use gradient checkpointing if fine-tuning |
|
|
| 2. **Quality Optimization**: |
| - FP8 provides best quality in this repository |
| - Combine multiple LoRAs at weights 0.6-0.8 |
| - Experiment with both high and low noise variants |
| - For maximum quality, use FP16 models from wan22-fp16 repository |
|
|
| 3. **Speed Optimization**: |
| - Use GGUF Q4 quantized models for rapid prototyping (fastest) |
| - FP8 models perform well on RTX 40 series with tensor cores |
| - Reduce num_inference_steps to 20-30 for testing |
| - Enable xformers attention: `pipe.enable_xformers_memory_efficient_attention()` |
|
|
| 4. **GPU-Specific Tips**: |
| - **RTX 40 series**: FP8 models perform excellently with native support |
| - **RTX 30 series**: FP8 still faster than FP16, use for 16GB+ cards |
| - **12GB GPUs**: Use GGUF Q4 models exclusively |
| - **16GB GPUs**: Choose between FP8 or GGUF Q8 based on quality needs |
|
|
| |
|
|
| |
|
|
| **Rotation**: "rotating camera", "camera circles around", "360-degree view", "orbital camera" |
|
|
| **Arc Shot**: "arc shot", "curved camera movement", "sweeping motion", "cinematic arc" |
|
|
| **Drone**: "aerial view", "drone shot", "bird's eye view", "flying camera", "overhead shot" |
|
|
| **Zoom**: "zooming out", "zoom in on subject", "dolly zoom" |
|
|
| |
|
|
| **Volumetric Lighting**: "volumetric light rays", "god rays", "atmospheric lighting", "light shafts" |
|
|
| **Cinematic Flare**: "lens flare", "cinematic bloom", "light bloom", "flare effects" |
|
|
| **Face Natural**: Use with portrait videos for more realistic facial expressions and movements |
|
|
| |
|
|
| - **`.safetensors`**: Secure tensor format, recommended for most use cases |
| - **`.gguf`**: Quantized format for memory-constrained environments |
|
|
| |
|
|
| |
|
|
| WAN 2.2 is designed for: |
| - **Content Creation**: Generate videos from text descriptions or images for creative projects, advertising, and entertainment |
| - **Prototyping**: Rapid video concept visualization for storyboarding and pre-production |
| - **Research**: Academic research in video generation, diffusion models, and controllable video synthesis |
| - **Application Development**: Building video generation features in applications and services |
|
|
| |
|
|
| - Fine-tuning on domain-specific video datasets |
| - Integration with video editing pipelines |
| - Custom LoRA development for specialized camera movements or visual effects |
| - Video dataset augmentation and synthetic data generation |
|
|
| |
|
|
| The model should NOT be used for: |
| - Generating deceptive, harmful, or misleading video content |
| - Creating deepfakes or non-consensual content of individuals |
| - Producing content that violates copyright or intellectual property rights |
| - Generating content intended to harass, abuse, or discriminate |
| - Creating videos for illegal purposes or activities |
|
|
| |
|
|
| |
|
|
| **Technical Limitations**: |
| - **Temporal Consistency**: May produce flickering or inconsistent motion in long sequences |
| - **Fine Details**: Small objects or intricate textures may lack detail or consistency |
| - **Physical Realism**: Generated physics may not always follow real-world rules (gravity, momentum, etc.) |
| - **Text Rendering**: Cannot reliably render readable text within generated videos |
| - **Face Quality**: Faces may show artifacts or unnatural movements (mitigated by face-naturalizer LoRA) |
| - **Memory Requirements**: High VRAM requirements limit accessibility (12-32GB depending on precision) |
|
|
| **Content Limitations**: |
| - Training data biases may affect representation of diverse demographics, cultures, and scenarios |
| - May struggle with uncommon objects, rare scenarios, or niche content |
| - Camera control may not always precisely match intended movements |
| - Generated content may reflect biases present in training data |
|
|
| |
|
|
| **Misuse Risks**: |
| - **Deepfakes and Misinformation**: Model could be used to create deceptive content |
| - *Mitigation*: Implement content authentication, watermarking, and usage monitoring |
| - **Copyright Infringement**: May generate content similar to copyrighted material |
| - *Mitigation*: Avoid training on copyrighted data, implement content filtering |
| - **Harmful Content**: Could generate disturbing or inappropriate content |
| - *Mitigation*: Implement safety filters, content moderation, and responsible use guidelines |
|
|
| **Ethical Considerations**: |
| - Users should obtain appropriate permissions before generating videos of identifiable individuals |
| - Generated content should be clearly labeled as AI-generated to prevent deception |
| - Consider environmental impact of compute-intensive inference |
| - Respect privacy, consent, and intellectual property rights |
|
|
| |
|
|
| - Implement content moderation and safety filters in production deployments |
| - Add visible/invisible watermarks to identify AI-generated content |
| - Provide clear disclaimers that content is AI-generated |
| - Monitor for misuse and implement usage policies |
| - Consider accessibility trade-offs when selecting model precision |
| - Validate outputs for unintended biases or harmful content before distribution |
|
|
| |
|
|
| |
|
|
| Training data details are not publicly available. Typical video diffusion models are trained on: |
| - Large-scale video datasets with diverse content |
| - Text-video pairs for caption conditioning |
| - Image-video pairs for image-to-video tasks |
|
|
| **Note**: Specific training dataset information should be obtained from the original model authors. |
|
|
| |
|
|
| **Training Hyperparameters** (typical for models of this scale): |
| - Architecture: Diffusion transformer with 14B parameters |
| - Precision formats: FP16, FP8, GGUF quantization |
| - Video VAE: Separate encoder/decoder for latent compression |
| - LoRA adapters: Rank-16 to rank-64 for camera control |
|
|
| **Noise Schedules**: |
| - **High-noise models**: Greater noise variance for creative generation |
| - **Low-noise models**: Lower noise variance for faithful reproduction |
|
|
| |
|
|
| **Inference Requirements (This Repository)**: |
| - **FP8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090, RTX 4070 Ti Super) |
| - **GGUF Q4**: 12-16GB VRAM (NVIDIA RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB) |
| - **GGUF Q8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090) |
|
|
| |
|
|
| Video generation models require significant computational resources. This FP8/GGUF repository provides more efficient alternatives: |
|
|
| - **Model Size**: 89GB total (FP8 + GGUF variants + LoRAs) |
| - **Inference Power**: 100-350W depending on GPU and model precision |
| - **Carbon Footprint**: Varies by energy source and usage patterns |
| - **Efficiency**: ~40% VRAM reduction vs FP16, enabling use on consumer GPUs |
|
|
| **Recommendations for Reducing Impact**: |
| - Use GGUF Q4 quantized models for maximum efficiency (8.2GB vs 27GB FP16) |
| - FP8 models provide excellent quality/efficiency balance |
| - Batch process multiple requests to amortize overhead |
| - Use energy-efficient hardware (RTX 40 series with tensor cores) |
| - Use renewable energy sources when possible |
| - Consider carbon offset for production deployments |
|
|
| |
|
|
| Please check the original WAN 2.2 model repository for specific license terms and usage restrictions. This repository uses the "other" license tag pending clarification of the original license. |
|
|
| |
|
|
| If you use WAN 2.2 in your research or applications, please cite the original model repository. |
|
|
| **BibTeX** (template): |
| ```bibtex |
| @misc{wan22, |
| title={WAN 2.2: Image-to-Video and Text-to-Video Generation}, |
| author={[Original Authors]}, |
| year={2024}, |
| howpublished={\url{https://huggingface.co/[original-repo]}}, |
| } |
| ``` |
|
|
| |
|
|
| This model card was created by the repository maintainer based on available model information and standard Hugging Face model card guidelines. |
|
|
| |
|
|
| For questions about this model card or repository, please open an issue in the repository or contact the original model authors. |
|
|
| |
|
|
| **Out of Memory Errors**: |
| - Switch from FP8 to GGUF Q4 quantized models (12GB VRAM) |
| - Switch from GGUF Q8 to Q4 if still out of memory |
| - Reduce `num_frames` (try 8 or 12 instead of 16) |
| - Reduce batch size to 1 |
| - Enable CPU offloading: `pipe.enable_model_cpu_offload()` |
| - Enable sequential CPU offload: `pipe.enable_sequential_cpu_offload()` |
|
|
| **Quality Issues**: |
| - Try both high-noise and low-noise variants |
| - If using GGUF Q4, try FP8 for better quality (requires 16GB+ VRAM) |
| - If using FP8 and need maximum quality, see wan22-fp16 repository |
| - Adjust LoRA weights (0.5-1.0 range) |
| - Increase `num_inference_steps` (50-100) |
|
|
| **Slow Generation**: |
| - GGUF Q4 models are fastest for rapid iteration |
| - Enable xformers: `pipe.enable_xformers_memory_efficient_attention()` |
| - Reduce inference steps to 20-30 for testing |
| - FP8 performs best on RTX 40 series GPUs with native support |
|
|
| **GGUF Model Loading Issues**: |
| - Ensure you're using a GGUF-compatible loader |
| - GGUF models may require specific diffusers versions |
| - Check llama.cpp or gguf-specific loading documentation |
|
|
| |
|
|
| For issues, questions, or contributions, please refer to the main Hugging Face model repository. |
|
|
| |
|
|
| - **wan22-fp16**: Full precision FP16 variants (27GB per model, maximum quality) |
| - **wan21-fp8**: WAN 2.1 FP8 models (camera control v1, I2V only) |
| - **wan21-fp16**: WAN 2.1 FP16 models (camera control v1, I2V only) |
|
|
| |
|
|
| This repository contains WAN 2.2 models optimized for deployment on consumer-grade hardware through FP8 and GGUF quantization: |
|
|
| - **89GB total** (vs 142GB for full precision variants) |
| - **FP8 models**: 14GB each, excellent quality/VRAM balance |
| - **GGUF Q4 models**: 8.2GB each, maximum memory efficiency |
| - **Camera Control v2**: Enhanced camera LoRAs vs v1 in WAN 2.1 |
| - **10 Enhancement LoRAs**: Camera control (5), lighting (2), face enhancement (1), quality boost (1), actions (1) |
| - **Both I2V and T2V**: Image-to-video and text-to-video capabilities |
|
|
| **Recommended for**: Production deployment, consumer GPUs (12GB+), balanced quality/performance needs |
|
|
| --- |
| |
| **Last Updated**: October 2024 |
| **Model Version**: WAN 2.2 FP8/GGUF |
| **Repository Type**: Quantized Model Weights Storage |
| **Repository Size**: ~89GB |
|
|