wan22-fp8-i2v / README.md.backup
wangkanai's picture
Upload folder using huggingface_hub
50ed743 verified
---
language:
- en
license: other
license_name: wan-license
library_name: diffusers
pipeline_tag: image-to-video
tags:
- video-generation
- image-to-video
- text-to-video
- diffusion
- video-diffusion
- camera-control
- lora
- wan
- wan22
- fp8
- quantized
- gguf
base_model: wan22
base_model_relation: quantized
inference: true
model-index:
- name: WAN 2.2 FP8/GGUF - I2V/T2V Models
results:
- task:
type: image-to-video
name: Image-to-Video Generation
metrics:
- name: Inference Steps
type: steps
value: 50
verified: false
- name: VRAM Usage (FP8)
type: memory_gb
value: 16
verified: false
- task:
type: text-to-video
name: Text-to-Video Generation
metrics:
- name: Inference Steps
type: steps
value: 50
verified: false
- name: VRAM Usage (FP8)
type: memory_gb
value: 16
verified: false
---
# WAN 2.2 FP8 - Image-to-Video and Text-to-Video Models
High-quality image-to-video (I2V) and text-to-video (T2V) generation models in FP8 and GGUF quantized formats, with advanced camera control and enhancement LoRAs for memory-efficient deployment.
## Model Description
WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization and GGUF formats for efficient deployment on consumer-grade hardware. This repository contains FP8 and GGUF quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models.
**Key Features**:
- 14B parameter diffusion-based architecture
- FP8 and GGUF quantized formats for memory efficiency (~50% smaller than FP16)
- Dedicated VAE for video latent encoding/decoding
- Extensive LoRA ecosystem for camera control (v2) and visual enhancement
- Support for both high-noise (creative) and low-noise (faithful) generation modes
- Text-to-video and image-to-video capabilities
**Model Statistics**:
- **Total Repository Size**: ~89GB
- **Model Architecture**: Diffusion-based video generation
- **Supported Formats**: `.safetensors` (FP8), `.gguf` (Q4/Q8 quantized)
- **Parameters**: 14 billion
- **Precision**: FP8 E4M3FN and GGUF Q4/Q8 quantization
- **Input**: Text prompts and/or images
- **Output**: Video sequences (typically 16-24 frames)
## How to Get Started with the Model
Quick start example for image-to-video generation with FP8:
```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
from PIL import Image
# Load your input image
input_image = Image.open("your_image.jpg")
# Load pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
"base-model-path",
torch_dtype=torch.float8_e4m3fn
)
# Load WAN 2.2 VAE
pipe.vae = AutoencoderKL.from_single_file(
"E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
)
# Load I2V model (FP8 for balanced performance)
pipe.unet = torch.load(
"E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors"
)
pipe.to("cuda")
# Generate video
video = pipe(
image=input_image,
prompt="cinematic shot, high quality",
num_inference_steps=50,
num_frames=16
).frames
```
For detailed usage examples including camera control, GGUF models, and LoRA combinations, see the [Usage](#usage) section below.
## Directory Structure
```
wan22-fp8/
├── diffusion_models/wan/ # FP8 and GGUF quantized I2V and T2V models
├── loras/wan/ # Camera control (v2), action, and enhancement LoRAs
└── vae/wan/ # Video VAE for latent encoding/decoding
```
## Models
### Base Diffusion Models
Located in `diffusion_models/wan/`
#### Text-to-Video (T2V) Models (FP8)
| Model | Precision | Size | VRAM Required | Use Case |
|-------|-----------|------|---------------|----------|
| `wan22-t2v-high-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, high noise schedule |
| `wan22-t2v-low-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, low noise schedule |
#### Image-to-Video (I2V) Models
**FP8 Precision** (Balanced Quality/Performance):
| Model | Size | VRAM Required | Description |
|-------|------|---------------|-------------|
| `wan22-i2v-high-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Creative generation, higher variance |
| `wan22-i2v-low-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Faithful reproduction, consistent results |
**GGUF Quantized** (Memory Efficient):
| Model | Size | VRAM Required | Quantization | Description |
|-------|------|---------------|--------------|-------------|
| `wan22-i2v-a14b-highnoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, high-noise |
| `wan22-i2v-a14b-lownoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, low-noise |
| `wan22-i2v-a14b-gguf-a14b-high.gguf` | 15GB | 16GB+ | Q8 | Higher precision quantization |
### Video VAE
Located in `vae/wan/`
- **File**: `wan22-vae.safetensors`
- **Size**: 1.4GB
- **Purpose**: Video latent encoder/decoder for compressing video frames
### Enhancement LoRAs
Located in `loras/wan/`
#### Camera Control LoRAs (v2 - Enhanced)
| LoRA | Size | Description | Prompt Examples |
|------|------|-------------|-----------------|
| `wan22-camera-rotation-rank16-v2.safetensors` | 293MB | Rotating camera movements | "rotating camera", "camera circles around subject" |
| `wan22-camera-arcshot-rank16-v2-high.safetensors` | 293MB | Cinematic arc shots | "arc shot", "curved camera movement" |
| `wan22-camera-drone-rank16-v2.safetensors` | 293MB | Aerial drone perspectives | "aerial view", "drone shot", "bird's eye view" |
| `wan22-camera-adr1a-v1.safetensors` | 293MB | Advanced camera control | Custom camera trajectories |
| `wan22-camera-earthzoomout.safetensors` | 293MB | Earth zoom-out effects | "zooming out from earth", "planet zoom" |
#### Visual Enhancement LoRAs
| LoRA | Size | Purpose | Effect |
|------|------|---------|--------|
| `wan22-face-naturalizer.safetensors` | 586MB | Face enhancement | More natural-looking facial movements |
| `wan22-light-volumetric.safetensors` | 293MB | Lighting effects | Volumetric lighting, god rays, atmospheric effects |
| `wan22-light-cinematicflare-i2v-low.safetensors` | 293MB | Lens flare effects | Cinematic lens flares and light blooms for I2V |
| `wan22-upscale-realismboost-t2v-14b.safetensors` | 293MB | Quality boost | Enhanced realism for T2V generation |
#### Action LoRAs
| LoRA | Size | Action Type | Application |
|------|------|-------------|-------------|
| `wan22-action-wink-i2v-v1-low-noise.safetensors` | 147MB | Facial actions | Controlled winking animations |
## Usage
### Basic Image-to-Video Generation (FP8)
```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
from PIL import Image
# Load input image
input_image = Image.open("path/to/your/image.jpg")
# Load I2V pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
"path-to-base-model",
torch_dtype=torch.float8_e4m3fn
)
# Load WAN 2.2 FP8 I2V model
pipe.unet = torch.load(
"E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors"
)
# Load WAN 2.2 VAE
pipe.vae = AutoencoderKL.from_single_file(
"E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
)
pipe.to("cuda")
# Generate video from image
video = pipe(
image=input_image,
prompt="cinematic shot, high quality",
num_inference_steps=50,
num_frames=16
).frames
# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)
```
### Text-to-Video Generation (FP8)
```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
# Load T2V pipeline
pipe = DiffusionPipeline.from_pretrained(
"path-to-base-model",
torch_dtype=torch.float8_e4m3fn
)
# Load WAN 2.2 FP8 T2V model
pipe.unet = torch.load(
"E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-t2v-low-noise-14b-fp8-scaled.safetensors"
)
# Load WAN 2.2 VAE
pipe.vae = AutoencoderKL.from_single_file(
"E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
)
pipe.to("cuda")
# Generate video from text
video = pipe(
prompt="a cat walking through a garden, high quality, cinematic",
num_inference_steps=50,
num_frames=16
).frames
```
### Using Camera Control LoRAs
```python
# After loading base pipeline, add camera control
pipe.load_lora_weights(
"E:/huggingface/wan22-fp8/loras/wan/wan22-camera-rotation-rank16-v2.safetensors"
)
# Generate with camera movement
video = pipe(
image=input_image,
prompt="rotating camera around a sculpture",
num_inference_steps=50
).frames
```
### Combining Multiple LoRAs
```python
# Load multiple LoRAs with different weights
pipe.load_lora_weights(
"E:/huggingface/wan22-fp8/loras/wan/wan22-camera-drone-rank16-v2.safetensors",
adapter_name="camera_drone"
)
pipe.load_lora_weights(
"E:/huggingface/wan22-fp8/loras/wan/wan22-light-volumetric.safetensors",
adapter_name="volumetric_light"
)
# Set LoRA weights
pipe.set_adapters(["camera_drone", "volumetric_light"], adapter_weights=[0.8, 0.6])
# Generate with combined effects
video = pipe(
image=input_image,
prompt="aerial drone shot with volumetric lighting at sunset",
num_inference_steps=50
).frames
```
### Using Cinematic Flare LoRA
```python
# Load cinematic flare LoRA for I2V
pipe.load_lora_weights(
"E:/huggingface/wan22-fp8/loras/wan/wan22-light-cinematicflare-i2v-low.safetensors"
)
# Generate with lens flare effects
video = pipe(
image=input_image,
prompt="cinematic lens flare, light bloom, professional cinematography",
num_inference_steps=50
).frames
```
## Model Selection Guide
### Precision Trade-offs (This Repository)
**FP8 Models** (Available in this repo):
- 50% smaller than FP16 (14GB vs 27GB)
- Minimal quality loss compared to FP16
- Faster inference on GPUs with tensor cores
- Balanced quality/performance
- Requires 16GB+ VRAM
- 🎯 Use for: Production deployment, most users, balanced quality
**GGUF Q4_K_S** (Available in this repo):
- Smallest size (8.2GB)
- Works on 12GB VRAM GPUs
- Fastest inference
- More quality degradation than FP8
- 🎯 Use for: Memory-constrained systems, rapid prototyping, testing
**GGUF Q8** (Available in this repo):
- Medium size (15GB)
- Better quality than Q4
- Works on 16GB VRAM GPUs
- 🎯 Use for: Balance between Q4 and FP8 quality
**FP16 Models** (Not in this repo):
- See separate wan22-fp16 repository for full precision variants
- 27GB per model, requires 24GB+ VRAM
- Maximum quality for research and archival use
### Noise Schedule Selection
**High-Noise Models**:
- More creative interpretation
- Better for abstract or stylized content
- Higher variance in outputs
**Low-Noise Models**:
- More faithful to input/prompt
- Better for realistic content
- More consistent results
## Hardware Requirements
| Model Type | Minimum VRAM | Recommended VRAM | GPU Examples |
|------------|--------------|------------------|--------------|
| I2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090, RTX 4070 Ti Super |
| I2V GGUF Q4 | 12GB | 16GB+ | RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB |
| I2V GGUF Q8 | 16GB | 20GB+ | RTX 4080, RTX 3090 |
| T2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090 |
**System Requirements**:
- CUDA 11.8+ or 12.1+
- PyTorch 2.1+ (with FP8 support)
- diffusers library 0.20+
- 89GB free disk space (full repository)
- 32GB+ system RAM recommended
## Performance Tips
1. **Memory Optimization**:
- Start with GGUF Q4 models on 12GB GPUs
- Use FP8 models for 16GB+ GPUs (best quality/VRAM balance)
- Enable `torch.cuda.amp` for mixed precision if needed
- Use gradient checkpointing if fine-tuning
2. **Quality Optimization**:
- FP8 provides best quality in this repository
- Combine multiple LoRAs at weights 0.6-0.8
- Experiment with both high and low noise variants
- For maximum quality, use FP16 models from wan22-fp16 repository
3. **Speed Optimization**:
- Use GGUF Q4 quantized models for rapid prototyping (fastest)
- FP8 models perform well on RTX 40 series with tensor cores
- Reduce num_inference_steps to 20-30 for testing
- Enable xformers attention: `pipe.enable_xformers_memory_efficient_attention()`
4. **GPU-Specific Tips**:
- **RTX 40 series**: FP8 models perform excellently with native support
- **RTX 30 series**: FP8 still faster than FP16, use for 16GB+ cards
- **12GB GPUs**: Use GGUF Q4 models exclusively
- **16GB GPUs**: Choose between FP8 or GGUF Q8 based on quality needs
## Prompting Guidelines
### Camera Movement Prompts
**Rotation**: "rotating camera", "camera circles around", "360-degree view", "orbital camera"
**Arc Shot**: "arc shot", "curved camera movement", "sweeping motion", "cinematic arc"
**Drone**: "aerial view", "drone shot", "bird's eye view", "flying camera", "overhead shot"
**Zoom**: "zooming out", "zoom in on subject", "dolly zoom"
### Enhancement Prompts
**Volumetric Lighting**: "volumetric light rays", "god rays", "atmospheric lighting", "light shafts"
**Cinematic Flare**: "lens flare", "cinematic bloom", "light bloom", "flare effects"
**Face Natural**: Use with portrait videos for more realistic facial expressions and movements
## File Formats
- **`.safetensors`**: Secure tensor format, recommended for most use cases
- **`.gguf`**: Quantized format for memory-constrained environments
## Intended Uses
### Direct Use
WAN 2.2 is designed for:
- **Content Creation**: Generate videos from text descriptions or images for creative projects, advertising, and entertainment
- **Prototyping**: Rapid video concept visualization for storyboarding and pre-production
- **Research**: Academic research in video generation, diffusion models, and controllable video synthesis
- **Application Development**: Building video generation features in applications and services
### Downstream Use
- Fine-tuning on domain-specific video datasets
- Integration with video editing pipelines
- Custom LoRA development for specialized camera movements or visual effects
- Video dataset augmentation and synthetic data generation
### Out-of-Scope Use
The model should NOT be used for:
- Generating deceptive, harmful, or misleading video content
- Creating deepfakes or non-consensual content of individuals
- Producing content that violates copyright or intellectual property rights
- Generating content intended to harass, abuse, or discriminate
- Creating videos for illegal purposes or activities
## Bias, Risks, and Limitations
### Known Limitations
**Technical Limitations**:
- **Temporal Consistency**: May produce flickering or inconsistent motion in long sequences
- **Fine Details**: Small objects or intricate textures may lack detail or consistency
- **Physical Realism**: Generated physics may not always follow real-world rules (gravity, momentum, etc.)
- **Text Rendering**: Cannot reliably render readable text within generated videos
- **Face Quality**: Faces may show artifacts or unnatural movements (mitigated by face-naturalizer LoRA)
- **Memory Requirements**: High VRAM requirements limit accessibility (12-32GB depending on precision)
**Content Limitations**:
- Training data biases may affect representation of diverse demographics, cultures, and scenarios
- May struggle with uncommon objects, rare scenarios, or niche content
- Camera control may not always precisely match intended movements
- Generated content may reflect biases present in training data
### Risks and Mitigations
**Misuse Risks**:
- **Deepfakes and Misinformation**: Model could be used to create deceptive content
- *Mitigation*: Implement content authentication, watermarking, and usage monitoring
- **Copyright Infringement**: May generate content similar to copyrighted material
- *Mitigation*: Avoid training on copyrighted data, implement content filtering
- **Harmful Content**: Could generate disturbing or inappropriate content
- *Mitigation*: Implement safety filters, content moderation, and responsible use guidelines
**Ethical Considerations**:
- Users should obtain appropriate permissions before generating videos of identifiable individuals
- Generated content should be clearly labeled as AI-generated to prevent deception
- Consider environmental impact of compute-intensive inference
- Respect privacy, consent, and intellectual property rights
### Recommendations
- Implement content moderation and safety filters in production deployments
- Add visible/invisible watermarks to identify AI-generated content
- Provide clear disclaimers that content is AI-generated
- Monitor for misuse and implement usage policies
- Consider accessibility trade-offs when selecting model precision
- Validate outputs for unintended biases or harmful content before distribution
## Training Details
### Training Data
Training data details are not publicly available. Typical video diffusion models are trained on:
- Large-scale video datasets with diverse content
- Text-video pairs for caption conditioning
- Image-video pairs for image-to-video tasks
**Note**: Specific training dataset information should be obtained from the original model authors.
### Training Procedure
**Training Hyperparameters** (typical for models of this scale):
- Architecture: Diffusion transformer with 14B parameters
- Precision formats: FP16, FP8, GGUF quantization
- Video VAE: Separate encoder/decoder for latent compression
- LoRA adapters: Rank-16 to rank-64 for camera control
**Noise Schedules**:
- **High-noise models**: Greater noise variance for creative generation
- **Low-noise models**: Lower noise variance for faithful reproduction
### Compute Infrastructure
**Inference Requirements (This Repository)**:
- **FP8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090, RTX 4070 Ti Super)
- **GGUF Q4**: 12-16GB VRAM (NVIDIA RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB)
- **GGUF Q8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090)
## Environmental Impact
Video generation models require significant computational resources. This FP8/GGUF repository provides more efficient alternatives:
- **Model Size**: 89GB total (FP8 + GGUF variants + LoRAs)
- **Inference Power**: 100-350W depending on GPU and model precision
- **Carbon Footprint**: Varies by energy source and usage patterns
- **Efficiency**: ~40% VRAM reduction vs FP16, enabling use on consumer GPUs
**Recommendations for Reducing Impact**:
- Use GGUF Q4 quantized models for maximum efficiency (8.2GB vs 27GB FP16)
- FP8 models provide excellent quality/efficiency balance
- Batch process multiple requests to amortize overhead
- Use energy-efficient hardware (RTX 40 series with tensor cores)
- Use renewable energy sources when possible
- Consider carbon offset for production deployments
## License
Please check the original WAN 2.2 model repository for specific license terms and usage restrictions. This repository uses the "other" license tag pending clarification of the original license.
## Citation
If you use WAN 2.2 in your research or applications, please cite the original model repository.
**BibTeX** (template):
```bibtex
@misc{wan22,
title={WAN 2.2: Image-to-Video and Text-to-Video Generation},
author={[Original Authors]},
year={2024},
howpublished={\url{https://huggingface.co/[original-repo]}},
}
```
## Model Card Authors
This model card was created by the repository maintainer based on available model information and standard Hugging Face model card guidelines.
## Model Card Contact
For questions about this model card or repository, please open an issue in the repository or contact the original model authors.
## Troubleshooting
**Out of Memory Errors**:
- Switch from FP8 to GGUF Q4 quantized models (12GB VRAM)
- Switch from GGUF Q8 to Q4 if still out of memory
- Reduce `num_frames` (try 8 or 12 instead of 16)
- Reduce batch size to 1
- Enable CPU offloading: `pipe.enable_model_cpu_offload()`
- Enable sequential CPU offload: `pipe.enable_sequential_cpu_offload()`
**Quality Issues**:
- Try both high-noise and low-noise variants
- If using GGUF Q4, try FP8 for better quality (requires 16GB+ VRAM)
- If using FP8 and need maximum quality, see wan22-fp16 repository
- Adjust LoRA weights (0.5-1.0 range)
- Increase `num_inference_steps` (50-100)
**Slow Generation**:
- GGUF Q4 models are fastest for rapid iteration
- Enable xformers: `pipe.enable_xformers_memory_efficient_attention()`
- Reduce inference steps to 20-30 for testing
- FP8 performs best on RTX 40 series GPUs with native support
**GGUF Model Loading Issues**:
- Ensure you're using a GGUF-compatible loader
- GGUF models may require specific diffusers versions
- Check llama.cpp or gguf-specific loading documentation
## Support
For issues, questions, or contributions, please refer to the main Hugging Face model repository.
## Related Repositories
- **wan22-fp16**: Full precision FP16 variants (27GB per model, maximum quality)
- **wan21-fp8**: WAN 2.1 FP8 models (camera control v1, I2V only)
- **wan21-fp16**: WAN 2.1 FP16 models (camera control v1, I2V only)
## Summary
This repository contains WAN 2.2 models optimized for deployment on consumer-grade hardware through FP8 and GGUF quantization:
- **89GB total** (vs 142GB for full precision variants)
- **FP8 models**: 14GB each, excellent quality/VRAM balance
- **GGUF Q4 models**: 8.2GB each, maximum memory efficiency
- **Camera Control v2**: Enhanced camera LoRAs vs v1 in WAN 2.1
- **10 Enhancement LoRAs**: Camera control (5), lighting (2), face enhancement (1), quality boost (1), actions (1)
- **Both I2V and T2V**: Image-to-video and text-to-video capabilities
**Recommended for**: Production deployment, consumer GPUs (12GB+), balanced quality/performance needs
---
**Last Updated**: October 2024
**Model Version**: WAN 2.2 FP8/GGUF
**Repository Type**: Quantized Model Weights Storage
**Repository Size**: ~89GB