--- license: other license_name: wan-license library_name: diffusers pipeline_tag: image-to-video tags: - wan - image-to-video - video-generation - wan21 - fp16 - 480p - diffusion - 14b-parameters --- # WAN 2.1 FP16 480p - Image-to-Video Diffusion Model High-fidelity 480p image-to-video generation model in full FP16 precision (14 billion parameters). Part of the WAN (Wan An) 2.1 model family for transforming static images into dynamic videos. ## Model Description WAN 2.1 I2V 480p is a 14-billion parameter transformer-based diffusion model that generates videos from static images. This FP16 variant provides maximum numerical precision and generation quality for research and high-quality video synthesis applications. The 480p resolution offers a balanced approach between quality and computational requirements. **Key Capabilities**: - Image-to-video generation with temporal coherence - 480p resolution output (balanced quality/performance) - Full FP16 precision (16-bit floating point) - Compatible with camera control LoRAs for cinematic effects - Optimized for research and professional production workflows ## Repository Contents ``` wan21-fp16-480p/ └── diffusion_models/ └── wan/ └── wan21-i2v-480p-14b-fp16.safetensors (31.0 GB) ``` **Total Repository Size**: 31.0 GB ### Model Files | File | Size | Description | |------|------|-------------| | `wan21-i2v-480p-14b-fp16.safetensors` | 31.0 GB | WAN 2.1 I2V 480p diffusion model (14B parameters, FP16 precision) | ## Hardware Requirements ### Minimum Requirements - **VRAM**: 32 GB (for basic inference) - **System RAM**: 32 GB - **Disk Space**: 31 GB for model file - **GPU**: NVIDIA GPU with FP16 support (RTX 3090, A6000, or better) ### Recommended Requirements - **VRAM**: 40 GB+ (for optimal performance and batch processing) - **System RAM**: 64 GB - **GPU**: High-end NVIDIA GPU (RTX 4090, A6000, A100) - **Storage**: SSD for faster model loading ### Performance Notes - FP16 precision requires more VRAM than quantized variants (FP8) - Enable memory optimization techniques for 24GB GPUs (gradient checkpointing, attention slicing) - For production deployment with lower VRAM, consider FP8 quantized variants ## Usage Examples ### Basic Image-to-Video Generation ```python from diffusers import DiffusionPipeline from PIL import Image import torch # Load the WAN 2.1 I2V 480p FP16 model pipe = DiffusionPipeline.from_single_file( "E:/huggingface/wan21-fp16-480p/diffusion_models/wan/wan21-i2v-480p-14b-fp16.safetensors", torch_dtype=torch.float16, use_safetensors=True ) pipe.to("cuda") # Load input image input_image = Image.open("path/to/your/image.jpg") # Generate video from image video = pipe( image=input_image, prompt="smooth camera movement, cinematic lighting", num_frames=24, num_inference_steps=50, guidance_scale=7.5 ).frames[0] # Export video from diffusers.utils import export_to_video export_to_video(video, "output_video.mp4", fps=8) ``` ### With Memory Optimization (for lower VRAM) ```python from diffusers import DiffusionPipeline import torch # Load model with memory optimizations pipe = DiffusionPipeline.from_single_file( "E:/huggingface/wan21-fp16-480p/diffusion_models/wan/wan21-i2v-480p-14b-fp16.safetensors", torch_dtype=torch.float16, use_safetensors=True ) # Enable memory-efficient attention pipe.enable_attention_slicing() pipe.enable_vae_slicing() # For even lower VRAM usage pipe.enable_model_cpu_offload() pipe.to("cuda") # Generate video with optimizations video = pipe( image=input_image, prompt="your prompt here", num_frames=16, # Reduce frames for lower memory num_inference_steps=30, # Fewer steps for faster generation guidance_scale=7.5 ).frames[0] ``` ### With Camera Control LoRAs ```python from diffusers import DiffusionPipeline from PIL import Image import torch # Load base model pipe = DiffusionPipeline.from_single_file( "E:/huggingface/wan21-fp16-480p/diffusion_models/wan/wan21-i2v-480p-14b-fp16.safetensors", torch_dtype=torch.float16, use_safetensors=True ) pipe.to("cuda") # Load camera control LoRA (requires separate download) # Example: rotation, arc shot, or drone camera movements pipe.load_lora_weights( "path/to/wan21-camera-rotation-rank16-v1.safetensors" ) # Generate with camera control video = pipe( image=input_image, prompt="rotating camera around the subject, cinematic", num_frames=24, num_inference_steps=50, guidance_scale=7.5 ).frames[0] export_to_video(video, "output_rotating.mp4", fps=8) ``` ## Model Specifications | Specification | Value | |--------------|-------| | **Architecture** | Transformer-based image-to-video diffusion model | | **Parameters** | 14 billion | | **Precision** | FP16 (16-bit floating point) | | **Resolution** | 480p (video output) | | **Format** | SafeTensors | | **Model Size** | 31.0 GB | | **Task** | Image-to-video generation | | **Library** | diffusers | | **Compatible LoRAs** | WAN 2.1 camera control LoRAs (rotation, arc shot, drone) | ### Technical Details - **FP16 Format**: 1 sign bit, 5-bit exponent, 10-bit mantissa - **Numerical Range**: ±65,504 (max value) - **Precision**: ~3-4 decimal digits - **Quality**: Full precision without quantization artifacts - **Compatibility**: All modern PyTorch versions with CUDA support ## Installation ```bash # Install required dependencies pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 pip install diffusers transformers accelerate safetensors pillow # For video export pip install opencv-python imageio imageio-ffmpeg ``` ### Requirements - Python 3.8+ - PyTorch 2.0+ - diffusers >= 0.21.0 - transformers - accelerate - safetensors - PIL/Pillow - CUDA 11.8+ (or compatible version) ## Performance Tips 1. **Memory Optimization** - Enable `attention_slicing()` and `vae_slicing()` for lower VRAM usage - Use `enable_model_cpu_offload()` for 24GB GPUs - Reduce `num_frames` and `num_inference_steps` for faster generation 2. **Quality Optimization** - Use `guidance_scale` between 7.0-9.0 for best results - Higher `num_inference_steps` (50-75) improves quality but increases time - Experiment with different sampling schedulers (DDIM, DPM++, Euler) 3. **Speed Optimization** - Use fewer inference steps (25-30) for faster generation - Reduce frame count for shorter videos - Consider FP8 quantized variants for production deployment 4. **Prompt Engineering** - Include motion descriptions: "smooth movement", "slow pan", "camera tracking" - Specify lighting: "cinematic lighting", "natural light", "dramatic shadows" - Add quality tokens: "high quality", "detailed", "professional" ## Version Comparison ### WAN 2.1 Variants | Variant | Precision | Size | VRAM | Use Case | |---------|-----------|------|------|----------| | **FP16 480p** (this) | FP16 | 31 GB | 32 GB+ | Research, archival quality | | FP16 720p | FP16 | 31 GB | 40 GB+ | Maximum quality output | | FP8 480p | FP8 | ~16 GB | 18 GB+ | Production, deployment | | FP8 720p | FP8 | ~16 GB | 24 GB+ | Production, high quality | ### Precision Trade-offs **FP16 Advantages**: - Maximum generation quality - Full numerical precision - No quantization artifacts - Research standard **FP16 Disadvantages**: - Higher VRAM requirements (2x vs FP8) - Larger file size (2x vs FP8) - Slower inference on tensor core GPUs - Higher deployment costs ### When to Use FP16 480p - Research and development - Quality benchmarking - Archival/professional production - GPU with 32GB+ VRAM available - Maximum quality requirements ### When to Consider Alternatives - **FP8 variants**: Production deployment, VRAM constraints, batch processing - **720p variants**: Higher resolution requirements - **WAN 2.2**: Enhanced camera controls, quality improvements ## Compatibility ### Compatible Components - **VAE**: WAN 2.1 VAE (separate download required) - **LoRAs**: WAN 2.1 camera control LoRAs - Camera rotation (rank-16) - Arc shot (rank-16) - Drone shot (rank-16) - **Frameworks**: diffusers, ComfyUI (with appropriate nodes) ### Camera Control LoRAs This model is compatible with WAN 2.1 camera control LoRAs for cinematic effects: - **Rotation**: Orbital camera movements around subjects - **Arc Shot**: Smooth curved dolly movements - **Drone**: Aerial and elevated perspectives *Note: LoRAs are not included and must be downloaded separately.* ## License This model uses a custom WAN license (`wan-license`). Please review the official WAN license terms before use. This may differ from standard open-source licenses and may include restrictions on commercial use, redistribution, or specific applications. ## Citation If you use this model in your research or projects, please cite: ```bibtex @software{wan21_i2v_480p_fp16, title={WAN 2.1 Image-to-Video 480p FP16}, year={2024}, note={14B parameter image-to-video diffusion model in full FP16 precision}, url={https://huggingface.co/wan21-fp16-480p} } ``` ## Related Resources ### WAN Model Family - **WAN 2.1 FP16 720p** - Higher resolution variant (31 GB, 40 GB+ VRAM) - **WAN 2.1 FP8** - Quantized variants for efficient deployment (~50% smaller) - **WAN 2.2** - Enhanced camera controls and quality improvements - **WAN LightX2V** - CFG step distillation adapters for faster generation ### Additional Components - **WAN 2.1 VAE** - Video variational autoencoder (243 MB, separate download) - **Camera Control LoRAs** - Cinematic camera movement adapters (343 MB each) - **Enhancement LoRAs** - Lighting, face quality, action improvements (WAN 2.2) ### Documentation - [WAN Official Documentation](https://huggingface.co/docs/diffusers/api/pipelines/wan) - [diffusers Library Documentation](https://huggingface.co/docs/diffusers) - [Camera Control LoRA Guide](https://huggingface.co/wan-models) ## Troubleshooting ### Common Issues **Out of Memory Errors**: ```python # Enable all memory optimizations pipe.enable_attention_slicing() pipe.enable_vae_slicing() pipe.enable_model_cpu_offload() # Reduce generation parameters num_frames=16 # Instead of 24 num_inference_steps=30 # Instead of 50 ``` **Slow Generation**: - Reduce `num_inference_steps` - Use fewer frames - Disable CPU offload if you have sufficient VRAM - Consider FP8 variants for faster inference **Quality Issues**: - Increase `num_inference_steps` (50-75) - Adjust `guidance_scale` (try 7.0-9.0) - Improve prompt quality and specificity - Ensure input image is high quality ## Best Practices 1. **Image Input**: Use high-quality input images (1024x1024 or higher) 2. **Prompts**: Be specific about motion, lighting, and camera movement 3. **Memory Management**: Monitor VRAM usage and enable optimizations as needed 4. **Experimentation**: Test different schedulers and parameters for your use case 5. **Responsible Use**: Follow ethical AI guidelines and license terms ## Technical Notes ### FP16 Precision Benefits - **Numerical Accuracy**: Full 16-bit floating point precision - **Quality**: No quantization artifacts or edge cases - **Compatibility**: Broad GPU and software ecosystem support - **Research Standard**: Industry standard for development and benchmarking ### VRAM Optimization Techniques ```python # Technique 1: Attention slicing (5-10% VRAM reduction) pipe.enable_attention_slicing() # Technique 2: VAE slicing (additional 5-10% VRAM reduction) pipe.enable_vae_slicing() # Technique 3: Model CPU offload (significant VRAM reduction, slower) pipe.enable_model_cpu_offload() # Technique 4: Sequential CPU offload (maximum VRAM reduction, slowest) pipe.enable_sequential_cpu_offload() ``` ## Changelog ### v1.0 (Current) - Initial release of WAN 2.1 I2V 480p FP16 model - 14 billion parameters - Full FP16 precision - 480p resolution output - Compatible with WAN 2.1 camera control LoRAs --- **Model Version**: v1.0 **Last Updated**: 2024-08-12 **Maintained By**: WAN Model Team For questions, issues, or contributions, please refer to the official WAN model repositories and community forums. --- ⚠️ **Important**: This is a high-precision model requiring significant computational resources. Ensure your hardware meets the minimum requirements before attempting to load and run this model. For production deployment or resource-constrained environments, consider the FP8 quantized variants.