wan22-fp8-i2v / README.md.backup

Upload folder using huggingface_hub

50ed743 verified 6 months ago

22.1 kB

	---
	language:
	- en
	license: other
	license_name: wan-license
	library_name: diffusers
	pipeline_tag: image-to-video
	tags:
	- video-generation
	- image-to-video
	- text-to-video
	- diffusion
	- video-diffusion
	- camera-control
	- lora
	- wan
	- wan22
	- fp8
	- quantized
	- gguf
	base_model: wan22
	base_model_relation: quantized
	inference: true
	model-index:
	- name: WAN 2.2 FP8/GGUF - I2V/T2V Models
	results:
	- task:
	type: image-to-video
	name: Image-to-Video Generation
	metrics:
	- name: Inference Steps
	type: steps
	value: 50
	verified: false
	- name: VRAM Usage (FP8)
	type: memory_gb
	value: 16
	verified: false
	- task:
	type: text-to-video
	name: Text-to-Video Generation
	metrics:
	- name: Inference Steps
	type: steps
	value: 50
	verified: false
	- name: VRAM Usage (FP8)
	type: memory_gb
	value: 16
	verified: false
	---

	# WAN 2.2 FP8 - Image-to-Video and Text-to-Video Models

	High-quality image-to-video (I2V) and text-to-video (T2V) generation models in FP8 and GGUF quantized formats, with advanced camera control and enhancement LoRAs for memory-efficient deployment.

	## Model Description

	WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization and GGUF formats for efficient deployment on consumer-grade hardware. This repository contains FP8 and GGUF quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models.

	Key Features:
	- 14B parameter diffusion-based architecture
	- FP8 and GGUF quantized formats for memory efficiency (~50% smaller than FP16)
	- Dedicated VAE for video latent encoding/decoding
	- Extensive LoRA ecosystem for camera control (v2) and visual enhancement
	- Support for both high-noise (creative) and low-noise (faithful) generation modes
	- Text-to-video and image-to-video capabilities

	Model Statistics:
	- Total Repository Size: ~89GB
	- Model Architecture: Diffusion-based video generation
	- Supported Formats: `.safetensors` (FP8), `.gguf` (Q4/Q8 quantized)
	- Parameters: 14 billion
	- Precision: FP8 E4M3FN and GGUF Q4/Q8 quantization
	- Input: Text prompts and/or images
	- Output: Video sequences (typically 16-24 frames)

	## How to Get Started with the Model

	Quick start example for image-to-video generation with FP8:

	```python
	from diffusers import DiffusionPipeline, AutoencoderKL
	import torch
	from PIL import Image

	# Load your input image
	input_image = Image.open("your_image.jpg")

	# Load pipeline with FP8 support
	pipe = DiffusionPipeline.from_pretrained(
	"base-model-path",
	torch_dtype=torch.float8_e4m3fn
	)

	# Load WAN 2.2 VAE
	pipe.vae = AutoencoderKL.from_single_file(
	"E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
	)

	# Load I2V model (FP8 for balanced performance)
	pipe.unet = torch.load(
	"E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors"
	)

	pipe.to("cuda")

	# Generate video
	video = pipe(
	image=input_image,
	prompt="cinematic shot, high quality",
	num_inference_steps=50,
	num_frames=16
	).frames
	```

	For detailed usage examples including camera control, GGUF models, and LoRA combinations, see the [Usage](#usage) section below.

	## Directory Structure

	```
	wan22-fp8/
	├── diffusion_models/wan/ # FP8 and GGUF quantized I2V and T2V models
	├── loras/wan/ # Camera control (v2), action, and enhancement LoRAs
	└── vae/wan/ # Video VAE for latent encoding/decoding
	```

	## Models

	### Base Diffusion Models

	Located in `diffusion_models/wan/`

	#### Text-to-Video (T2V) Models (FP8)
	\| Model \| Precision \| Size \| VRAM Required \| Use Case \|
	\|-------\|-----------\|------\|---------------\|----------\|
	\| `wan22-t2v-high-noise-14b-fp8-scaled.safetensors` \| FP8 \| 14GB \| 16GB+ \| General T2V, high noise schedule \|
	\| `wan22-t2v-low-noise-14b-fp8-scaled.safetensors` \| FP8 \| 14GB \| 16GB+ \| General T2V, low noise schedule \|

	#### Image-to-Video (I2V) Models

	FP8 Precision (Balanced Quality/Performance):
	\| Model \| Size \| VRAM Required \| Description \|
	\|-------\|------\|---------------\|-------------\|
	\| `wan22-i2v-high-noise-14b-fp8-scaled.safetensors` \| 14GB \| 16GB+ \| Creative generation, higher variance \|
	\| `wan22-i2v-low-noise-14b-fp8-scaled.safetensors` \| 14GB \| 16GB+ \| Faithful reproduction, consistent results \|

	GGUF Quantized (Memory Efficient):
	\| Model \| Size \| VRAM Required \| Quantization \| Description \|
	\|-------\|------\|---------------\|--------------\|-------------\|
	\| `wan22-i2v-a14b-highnoise-q4-k-s.gguf` \| 8.2GB \| 12GB+ \| Q4_K_S \| Most memory efficient, high-noise \|
	\| `wan22-i2v-a14b-lownoise-q4-k-s.gguf` \| 8.2GB \| 12GB+ \| Q4_K_S \| Most memory efficient, low-noise \|
	\| `wan22-i2v-a14b-gguf-a14b-high.gguf` \| 15GB \| 16GB+ \| Q8 \| Higher precision quantization \|

	### Video VAE

	Located in `vae/wan/`

	- File: `wan22-vae.safetensors`
	- Size: 1.4GB
	- Purpose: Video latent encoder/decoder for compressing video frames

	### Enhancement LoRAs

	Located in `loras/wan/`

	#### Camera Control LoRAs (v2 - Enhanced)
	\| LoRA \| Size \| Description \| Prompt Examples \|
	\|------\|------\|-------------\|-----------------\|
	\| `wan22-camera-rotation-rank16-v2.safetensors` \| 293MB \| Rotating camera movements \| "rotating camera", "camera circles around subject" \|
	\| `wan22-camera-arcshot-rank16-v2-high.safetensors` \| 293MB \| Cinematic arc shots \| "arc shot", "curved camera movement" \|
	\| `wan22-camera-drone-rank16-v2.safetensors` \| 293MB \| Aerial drone perspectives \| "aerial view", "drone shot", "bird's eye view" \|
	\| `wan22-camera-adr1a-v1.safetensors` \| 293MB \| Advanced camera control \| Custom camera trajectories \|
	\| `wan22-camera-earthzoomout.safetensors` \| 293MB \| Earth zoom-out effects \| "zooming out from earth", "planet zoom" \|

	#### Visual Enhancement LoRAs
	\| LoRA \| Size \| Purpose \| Effect \|
	\|------\|------\|---------\|--------\|
	\| `wan22-face-naturalizer.safetensors` \| 586MB \| Face enhancement \| More natural-looking facial movements \|
	\| `wan22-light-volumetric.safetensors` \| 293MB \| Lighting effects \| Volumetric lighting, god rays, atmospheric effects \|
	\| `wan22-light-cinematicflare-i2v-low.safetensors` \| 293MB \| Lens flare effects \| Cinematic lens flares and light blooms for I2V \|
	\| `wan22-upscale-realismboost-t2v-14b.safetensors` \| 293MB \| Quality boost \| Enhanced realism for T2V generation \|

	#### Action LoRAs
	\| LoRA \| Size \| Action Type \| Application \|
	\|------\|------\|-------------\|-------------\|
	\| `wan22-action-wink-i2v-v1-low-noise.safetensors` \| 147MB \| Facial actions \| Controlled winking animations \|

	## Usage

	### Basic Image-to-Video Generation (FP8)

	```python
	from diffusers import DiffusionPipeline, AutoencoderKL
	import torch
	from PIL import Image

	# Load input image
	input_image = Image.open("path/to/your/image.jpg")

	# Load I2V pipeline with FP8 support
	pipe = DiffusionPipeline.from_pretrained(
	"path-to-base-model",
	torch_dtype=torch.float8_e4m3fn
	)

	# Load WAN 2.2 FP8 I2V model
	pipe.unet = torch.load(
	"E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors"
	)

	# Load WAN 2.2 VAE
	pipe.vae = AutoencoderKL.from_single_file(
	"E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
	)

	pipe.to("cuda")

	# Generate video from image
	video = pipe(
	image=input_image,
	prompt="cinematic shot, high quality",
	num_inference_steps=50,
	num_frames=16
	).frames

	# Save video
	from diffusers.utils import export_to_video
	export_to_video(video, "output.mp4", fps=8)
	```

	### Text-to-Video Generation (FP8)

	```python
	from diffusers import DiffusionPipeline, AutoencoderKL
	import torch

	# Load T2V pipeline
	pipe = DiffusionPipeline.from_pretrained(
	"path-to-base-model",
	torch_dtype=torch.float8_e4m3fn
	)

	# Load WAN 2.2 FP8 T2V model
	pipe.unet = torch.load(
	"E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-t2v-low-noise-14b-fp8-scaled.safetensors"
	)

	# Load WAN 2.2 VAE
	pipe.vae = AutoencoderKL.from_single_file(
	"E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
	)

	pipe.to("cuda")

	# Generate video from text
	video = pipe(
	prompt="a cat walking through a garden, high quality, cinematic",
	num_inference_steps=50,
	num_frames=16
	).frames
	```

	### Using Camera Control LoRAs

	```python
	# After loading base pipeline, add camera control
	pipe.load_lora_weights(
	"E:/huggingface/wan22-fp8/loras/wan/wan22-camera-rotation-rank16-v2.safetensors"
	)

	# Generate with camera movement
	video = pipe(
	image=input_image,
	prompt="rotating camera around a sculpture",
	num_inference_steps=50
	).frames
	```

	### Combining Multiple LoRAs

	```python
	# Load multiple LoRAs with different weights
	pipe.load_lora_weights(
	"E:/huggingface/wan22-fp8/loras/wan/wan22-camera-drone-rank16-v2.safetensors",
	adapter_name="camera_drone"
	)
	pipe.load_lora_weights(
	"E:/huggingface/wan22-fp8/loras/wan/wan22-light-volumetric.safetensors",
	adapter_name="volumetric_light"
	)

	# Set LoRA weights
	pipe.set_adapters(["camera_drone", "volumetric_light"], adapter_weights=[0.8, 0.6])

	# Generate with combined effects
	video = pipe(
	image=input_image,
	prompt="aerial drone shot with volumetric lighting at sunset",
	num_inference_steps=50
	).frames
	```

	### Using Cinematic Flare LoRA

	```python
	# Load cinematic flare LoRA for I2V
	pipe.load_lora_weights(
	"E:/huggingface/wan22-fp8/loras/wan/wan22-light-cinematicflare-i2v-low.safetensors"
	)

	# Generate with lens flare effects
	video = pipe(
	image=input_image,
	prompt="cinematic lens flare, light bloom, professional cinematography",
	num_inference_steps=50
	).frames
	```

	## Model Selection Guide

	### Precision Trade-offs (This Repository)

	FP8 Models (Available in this repo):
	- ✅ 50% smaller than FP16 (14GB vs 27GB)
	- ✅ Minimal quality loss compared to FP16
	- ✅ Faster inference on GPUs with tensor cores
	- ✅ Balanced quality/performance
	- ❌ Requires 16GB+ VRAM
	- 🎯 Use for: Production deployment, most users, balanced quality

	GGUF Q4_K_S (Available in this repo):
	- ✅ Smallest size (8.2GB)
	- ✅ Works on 12GB VRAM GPUs
	- ✅ Fastest inference
	- ❌ More quality degradation than FP8
	- 🎯 Use for: Memory-constrained systems, rapid prototyping, testing

	GGUF Q8 (Available in this repo):
	- ✅ Medium size (15GB)
	- ✅ Better quality than Q4
	- ✅ Works on 16GB VRAM GPUs
	- 🎯 Use for: Balance between Q4 and FP8 quality

	FP16 Models (Not in this repo):
	- See separate wan22-fp16 repository for full precision variants
	- 27GB per model, requires 24GB+ VRAM
	- Maximum quality for research and archival use

	### Noise Schedule Selection

	High-Noise Models:
	- More creative interpretation
	- Better for abstract or stylized content
	- Higher variance in outputs

	Low-Noise Models:
	- More faithful to input/prompt
	- Better for realistic content
	- More consistent results

	## Hardware Requirements

	\| Model Type \| Minimum VRAM \| Recommended VRAM \| GPU Examples \|
	\|------------\|--------------\|------------------\|--------------\|
	\| I2V FP8 \| 16GB \| 20GB+ \| RTX 4080, RTX 3090, RTX 4070 Ti Super \|
	\| I2V GGUF Q4 \| 12GB \| 16GB+ \| RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB \|
	\| I2V GGUF Q8 \| 16GB \| 20GB+ \| RTX 4080, RTX 3090 \|
	\| T2V FP8 \| 16GB \| 20GB+ \| RTX 4080, RTX 3090 \|

	System Requirements:
	- CUDA 11.8+ or 12.1+
	- PyTorch 2.1+ (with FP8 support)
	- diffusers library 0.20+
	- 89GB free disk space (full repository)
	- 32GB+ system RAM recommended

	## Performance Tips

	1. Memory Optimization:
	- Start with GGUF Q4 models on 12GB GPUs
	- Use FP8 models for 16GB+ GPUs (best quality/VRAM balance)
	- Enable `torch.cuda.amp` for mixed precision if needed
	- Use gradient checkpointing if fine-tuning

	2. Quality Optimization:
	- FP8 provides best quality in this repository
	- Combine multiple LoRAs at weights 0.6-0.8
	- Experiment with both high and low noise variants
	- For maximum quality, use FP16 models from wan22-fp16 repository

	3. Speed Optimization:
	- Use GGUF Q4 quantized models for rapid prototyping (fastest)
	- FP8 models perform well on RTX 40 series with tensor cores
	- Reduce num_inference_steps to 20-30 for testing
	- Enable xformers attention: `pipe.enable_xformers_memory_efficient_attention()`

	4. GPU-Specific Tips:
	- RTX 40 series: FP8 models perform excellently with native support
	- RTX 30 series: FP8 still faster than FP16, use for 16GB+ cards
	- 12GB GPUs: Use GGUF Q4 models exclusively
	- 16GB GPUs: Choose between FP8 or GGUF Q8 based on quality needs

	## Prompting Guidelines

	### Camera Movement Prompts

	Rotation: "rotating camera", "camera circles around", "360-degree view", "orbital camera"

	Arc Shot: "arc shot", "curved camera movement", "sweeping motion", "cinematic arc"

	Drone: "aerial view", "drone shot", "bird's eye view", "flying camera", "overhead shot"

	Zoom: "zooming out", "zoom in on subject", "dolly zoom"

	### Enhancement Prompts

	Volumetric Lighting: "volumetric light rays", "god rays", "atmospheric lighting", "light shafts"

	Cinematic Flare: "lens flare", "cinematic bloom", "light bloom", "flare effects"

	Face Natural: Use with portrait videos for more realistic facial expressions and movements

	## File Formats

	- `.safetensors`: Secure tensor format, recommended for most use cases
	- `.gguf`: Quantized format for memory-constrained environments

	## Intended Uses

	### Direct Use

	WAN 2.2 is designed for:
	- Content Creation: Generate videos from text descriptions or images for creative projects, advertising, and entertainment
	- Prototyping: Rapid video concept visualization for storyboarding and pre-production
	- Research: Academic research in video generation, diffusion models, and controllable video synthesis
	- Application Development: Building video generation features in applications and services

	### Downstream Use

	- Fine-tuning on domain-specific video datasets
	- Integration with video editing pipelines
	- Custom LoRA development for specialized camera movements or visual effects
	- Video dataset augmentation and synthetic data generation

	### Out-of-Scope Use

	The model should NOT be used for:
	- Generating deceptive, harmful, or misleading video content
	- Creating deepfakes or non-consensual content of individuals
	- Producing content that violates copyright or intellectual property rights
	- Generating content intended to harass, abuse, or discriminate
	- Creating videos for illegal purposes or activities

	## Bias, Risks, and Limitations

	### Known Limitations

	Technical Limitations:
	- Temporal Consistency: May produce flickering or inconsistent motion in long sequences
	- Fine Details: Small objects or intricate textures may lack detail or consistency
	- Physical Realism: Generated physics may not always follow real-world rules (gravity, momentum, etc.)
	- Text Rendering: Cannot reliably render readable text within generated videos
	- Face Quality: Faces may show artifacts or unnatural movements (mitigated by face-naturalizer LoRA)
	- Memory Requirements: High VRAM requirements limit accessibility (12-32GB depending on precision)

	Content Limitations:
	- Training data biases may affect representation of diverse demographics, cultures, and scenarios
	- May struggle with uncommon objects, rare scenarios, or niche content
	- Camera control may not always precisely match intended movements
	- Generated content may reflect biases present in training data

	### Risks and Mitigations

	Misuse Risks:
	- Deepfakes and Misinformation: Model could be used to create deceptive content
	- Mitigation: Implement content authentication, watermarking, and usage monitoring
	- Copyright Infringement: May generate content similar to copyrighted material
	- Mitigation: Avoid training on copyrighted data, implement content filtering
	- Harmful Content: Could generate disturbing or inappropriate content
	- Mitigation: Implement safety filters, content moderation, and responsible use guidelines

	Ethical Considerations:
	- Users should obtain appropriate permissions before generating videos of identifiable individuals
	- Generated content should be clearly labeled as AI-generated to prevent deception
	- Consider environmental impact of compute-intensive inference
	- Respect privacy, consent, and intellectual property rights

	### Recommendations

	- Implement content moderation and safety filters in production deployments
	- Add visible/invisible watermarks to identify AI-generated content
	- Provide clear disclaimers that content is AI-generated
	- Monitor for misuse and implement usage policies
	- Consider accessibility trade-offs when selecting model precision
	- Validate outputs for unintended biases or harmful content before distribution

	## Training Details

	### Training Data

	Training data details are not publicly available. Typical video diffusion models are trained on:
	- Large-scale video datasets with diverse content
	- Text-video pairs for caption conditioning
	- Image-video pairs for image-to-video tasks

	Note: Specific training dataset information should be obtained from the original model authors.

	### Training Procedure

	Training Hyperparameters (typical for models of this scale):
	- Architecture: Diffusion transformer with 14B parameters
	- Precision formats: FP16, FP8, GGUF quantization
	- Video VAE: Separate encoder/decoder for latent compression
	- LoRA adapters: Rank-16 to rank-64 for camera control

	Noise Schedules:
	- High-noise models: Greater noise variance for creative generation
	- Low-noise models: Lower noise variance for faithful reproduction

	### Compute Infrastructure

	Inference Requirements (This Repository):
	- FP8: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090, RTX 4070 Ti Super)
	- GGUF Q4: 12-16GB VRAM (NVIDIA RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB)
	- GGUF Q8: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090)

	## Environmental Impact

	Video generation models require significant computational resources. This FP8/GGUF repository provides more efficient alternatives:

	- Model Size: 89GB total (FP8 + GGUF variants + LoRAs)
	- Inference Power: 100-350W depending on GPU and model precision
	- Carbon Footprint: Varies by energy source and usage patterns
	- Efficiency: ~40% VRAM reduction vs FP16, enabling use on consumer GPUs

	Recommendations for Reducing Impact:
	- Use GGUF Q4 quantized models for maximum efficiency (8.2GB vs 27GB FP16)
	- FP8 models provide excellent quality/efficiency balance
	- Batch process multiple requests to amortize overhead
	- Use energy-efficient hardware (RTX 40 series with tensor cores)
	- Use renewable energy sources when possible
	- Consider carbon offset for production deployments

	## License

	Please check the original WAN 2.2 model repository for specific license terms and usage restrictions. This repository uses the "other" license tag pending clarification of the original license.

	## Citation

	If you use WAN 2.2 in your research or applications, please cite the original model repository.

	BibTeX (template):
	```bibtex
	@misc{wan22,
	title={WAN 2.2: Image-to-Video and Text-to-Video Generation},
	author={[Original Authors]},
	year={2024},
	howpublished={\url{https://huggingface.co/[original-repo]}},
	}
	```

	## Model Card Authors

	This model card was created by the repository maintainer based on available model information and standard Hugging Face model card guidelines.

	## Model Card Contact

	For questions about this model card or repository, please open an issue in the repository or contact the original model authors.

	## Troubleshooting

	Out of Memory Errors:
	- Switch from FP8 to GGUF Q4 quantized models (12GB VRAM)
	- Switch from GGUF Q8 to Q4 if still out of memory
	- Reduce `num_frames` (try 8 or 12 instead of 16)
	- Reduce batch size to 1
	- Enable CPU offloading: `pipe.enable_model_cpu_offload()`
	- Enable sequential CPU offload: `pipe.enable_sequential_cpu_offload()`

	Quality Issues:
	- Try both high-noise and low-noise variants
	- If using GGUF Q4, try FP8 for better quality (requires 16GB+ VRAM)
	- If using FP8 and need maximum quality, see wan22-fp16 repository
	- Adjust LoRA weights (0.5-1.0 range)
	- Increase `num_inference_steps` (50-100)

	Slow Generation:
	- GGUF Q4 models are fastest for rapid iteration
	- Enable xformers: `pipe.enable_xformers_memory_efficient_attention()`
	- Reduce inference steps to 20-30 for testing
	- FP8 performs best on RTX 40 series GPUs with native support

	GGUF Model Loading Issues:
	- Ensure you're using a GGUF-compatible loader
	- GGUF models may require specific diffusers versions
	- Check llama.cpp or gguf-specific loading documentation

	## Support

	For issues, questions, or contributions, please refer to the main Hugging Face model repository.

	## Related Repositories

	- wan22-fp16: Full precision FP16 variants (27GB per model, maximum quality)
	- wan21-fp8: WAN 2.1 FP8 models (camera control v1, I2V only)
	- wan21-fp16: WAN 2.1 FP16 models (camera control v1, I2V only)

	## Summary

	This repository contains WAN 2.2 models optimized for deployment on consumer-grade hardware through FP8 and GGUF quantization:

	- 89GB total (vs 142GB for full precision variants)
	- FP8 models: 14GB each, excellent quality/VRAM balance
	- GGUF Q4 models: 8.2GB each, maximum memory efficiency
	- Camera Control v2: Enhanced camera LoRAs vs v1 in WAN 2.1
	- 10 Enhancement LoRAs: Camera control (5), lighting (2), face enhancement (1), quality boost (1), actions (1)
	- Both I2V and T2V: Image-to-video and text-to-video capabilities

	Recommended for: Production deployment, consumer GPUs (12GB+), balanced quality/performance needs

	---

	Last Updated: October 2024
	Model Version: WAN 2.2 FP8/GGUF
	Repository Type: Quantized Model Weights Storage
	Repository Size: ~89GB