|
|
--- |
|
|
license: openrail++ |
|
|
library_name: diffusers |
|
|
pipeline_tag: text-to-image |
|
|
tags: |
|
|
- sdxl |
|
|
- text-to-image |
|
|
- image-generation |
|
|
--- |
|
|
|
|
|
<!-- README Version: v1.4 --> |
|
|
|
|
|
# SDXL VAE v1.0 - Improved Variational Autoencoder |
|
|
|
|
|
High-quality Variational Autoencoder (VAE) for Stable Diffusion XL (SDXL) models, featuring enhanced reconstruction quality and improved detail preservation. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The SDXL VAE is an improved variational autoencoder component for Stable Diffusion XL that significantly enhances the quality of generated images. This VAE was specifically retrained by Stability AI with optimized training parameters to improve local, high-frequency details in generated images. |
|
|
|
|
|
**Key Improvements:** |
|
|
- **Enhanced Training**: Trained with larger batch size (256 vs 9) for better convergence |
|
|
- **Exponential Moving Average (EMA)**: Weight tracking with EMA for improved stability |
|
|
- **Superior Reconstruction**: Outperforms original SD VAE across all evaluation metrics |
|
|
- **Detail Preservation**: Significantly better at preserving fine details and textures |
|
|
- **Face Quality**: Trained on LAION-Aesthetics and LAION-Humans for improved human subject rendering |
|
|
|
|
|
This VAE is compatible with all SDXL-based models and can be used as a drop-in replacement for the standard VAE to improve output quality. |
|
|
|
|
|
## Repository Contents |
|
|
|
|
|
``` |
|
|
E:\huggingface\sdxl-vae\ |
|
|
βββ vae\ |
|
|
β βββ sdxl\ |
|
|
β βββ sdxl-vae.safetensors # 320 MB - SDXL VAE weights |
|
|
βββ .cache\ # Cache directory |
|
|
βββ README.md # This file |
|
|
|
|
|
Total Repository Size: ~320 MB |
|
|
``` |
|
|
|
|
|
### Model Files |
|
|
|
|
|
| File | Size | Format | Description | |
|
|
|------|------|--------|-------------| |
|
|
| `sdxl-vae.safetensors` | 320 MB | SafeTensors | SDXL VAE model weights | |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
### Minimum Requirements |
|
|
- **VRAM**: 4 GB (with image generation model) |
|
|
- **Disk Space**: 400 MB |
|
|
- **System RAM**: 8 GB |
|
|
|
|
|
### Recommended Requirements |
|
|
- **VRAM**: 8+ GB for optimal performance |
|
|
- **Disk Space**: 500 MB with cache |
|
|
- **System RAM**: 16 GB |
|
|
|
|
|
### Performance Notes |
|
|
- The VAE itself uses minimal VRAM (~500 MB) |
|
|
- Total VRAM depends on the main diffusion model used |
|
|
- Encoding/decoding is fast and adds minimal overhead |
|
|
|
|
|
## Usage Examples |
|
|
|
|
|
### With Diffusers Library (Recommended) |
|
|
|
|
|
```python |
|
|
from diffusers import StableDiffusionXLPipeline, AutoencoderKL |
|
|
import torch |
|
|
|
|
|
# Load the improved SDXL VAE |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/sdxl-vae/vae/sdxl", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
# Load SDXL pipeline with custom VAE |
|
|
pipe = StableDiffusionXLPipeline.from_pretrained( |
|
|
"stabilityai/stable-diffusion-xl-base-1.0", |
|
|
vae=vae, |
|
|
torch_dtype=torch.float16, |
|
|
variant="fp16", |
|
|
use_safetensors=True |
|
|
) |
|
|
|
|
|
pipe = pipe.to("cuda") |
|
|
|
|
|
# Generate image with improved VAE |
|
|
prompt = "A majestic mountain landscape at sunset, highly detailed" |
|
|
image = pipe( |
|
|
prompt=prompt, |
|
|
num_inference_steps=50, |
|
|
guidance_scale=7.5 |
|
|
).images[0] |
|
|
|
|
|
image.save("output.png") |
|
|
``` |
|
|
|
|
|
### Loading from Hugging Face Hub |
|
|
|
|
|
```python |
|
|
from diffusers import AutoencoderKL |
|
|
|
|
|
# Load directly from Hugging Face |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"stabilityai/sdxl-vae", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
``` |
|
|
|
|
|
### Replace VAE in Existing Pipeline |
|
|
|
|
|
```python |
|
|
from diffusers import StableDiffusionXLPipeline, AutoencoderKL |
|
|
import torch |
|
|
|
|
|
# Load your existing SDXL pipeline |
|
|
pipe = StableDiffusionXLPipeline.from_pretrained( |
|
|
"your-sdxl-model-path", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
# Replace with improved VAE |
|
|
improved_vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/sdxl-vae/vae/sdxl", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
|
|
|
pipe.vae = improved_vae |
|
|
pipe = pipe.to("cuda") |
|
|
|
|
|
# Generate with improved quality |
|
|
image = pipe("detailed portrait photograph").images[0] |
|
|
``` |
|
|
|
|
|
### Manual Encoding/Decoding |
|
|
|
|
|
```python |
|
|
from diffusers import AutoencoderKL |
|
|
from PIL import Image |
|
|
import torch |
|
|
from torchvision import transforms |
|
|
|
|
|
# Load VAE |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/sdxl-vae/vae/sdxl", |
|
|
torch_dtype=torch.float16 |
|
|
).to("cuda") |
|
|
|
|
|
# Load and preprocess image |
|
|
image = Image.open("input.png").convert("RGB") |
|
|
transform = transforms.Compose([ |
|
|
transforms.Resize((1024, 1024)), |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize([0.5], [0.5]) |
|
|
]) |
|
|
image_tensor = transform(image).unsqueeze(0).to("cuda", dtype=torch.float16) |
|
|
|
|
|
# Encode to latent space |
|
|
with torch.no_grad(): |
|
|
latents = vae.encode(image_tensor).latent_dist.sample() |
|
|
latents = latents * vae.config.scaling_factor |
|
|
|
|
|
# Decode back to image space |
|
|
with torch.no_grad(): |
|
|
latents = latents / vae.config.scaling_factor |
|
|
reconstructed = vae.decode(latents).sample |
|
|
|
|
|
# Convert to PIL image |
|
|
reconstructed = (reconstructed / 2 + 0.5).clamp(0, 1) |
|
|
reconstructed = reconstructed.cpu().permute(0, 2, 3, 1).numpy()[0] |
|
|
output_image = Image.fromarray((reconstructed * 255).astype("uint8")) |
|
|
output_image.save("reconstructed.png") |
|
|
``` |
|
|
|
|
|
### ComfyUI Integration |
|
|
|
|
|
```python |
|
|
# In ComfyUI, use the "Load VAE" node |
|
|
# Point to: E:\huggingface\sdxl-vae\vae\sdxl\sdxl-vae.safetensors |
|
|
# Connect to your SDXL model's VAE input |
|
|
``` |
|
|
|
|
|
## Model Specifications |
|
|
|
|
|
### Architecture |
|
|
- **Type**: Variational Autoencoder (VAE) |
|
|
- **Architecture**: SDXL AutoencoderKL |
|
|
- **Latent Channels**: 4 |
|
|
- **Latent Dimension**: 8Γ compression (1024Γ1024 β 128Γ128 latent) |
|
|
- **Training Batch Size**: 256 (vs 9 in original SD VAE) |
|
|
- **Weight Tracking**: Exponential Moving Average (EMA) |
|
|
|
|
|
### Technical Details |
|
|
- **Format**: SafeTensors (secure, efficient) |
|
|
- **Precision**: FP32 (native), compatible with FP16 |
|
|
- **Input Resolution**: 1024Γ1024 native, supports variable sizes |
|
|
- **Latent Space**: 4-channel continuous latent representation |
|
|
- **Compression Ratio**: 8Γ spatial compression per dimension |
|
|
|
|
|
### Training Data |
|
|
- **Datasets**: LAION-Aesthetics + LAION-Humans |
|
|
- **Focus**: High-quality face and human subject reconstruction |
|
|
- **Evaluation**: COCO 2017 validation (256Γ256 images) |
|
|
|
|
|
### Performance Metrics |
|
|
Compared to original SD VAE, SDXL VAE achieves: |
|
|
- **rFID**: Lower (better reconstruction quality) |
|
|
- **PSNR**: Higher (better signal quality) |
|
|
- **SSIM**: Higher (better structural similarity) |
|
|
- **PSIM**: Lower (better perceptual quality) |
|
|
|
|
|
## Performance Tips |
|
|
|
|
|
### Optimization Strategies |
|
|
|
|
|
1. **Use FP16 for Speed** |
|
|
```python |
|
|
vae = AutoencoderKL.from_pretrained( |
|
|
"E:/huggingface/sdxl-vae/vae/sdxl", |
|
|
torch_dtype=torch.float16 # 2Γ faster, minimal quality loss |
|
|
) |
|
|
``` |
|
|
|
|
|
2. **Enable Memory-Efficient Attention** |
|
|
```python |
|
|
pipe.enable_attention_slicing() |
|
|
pipe.enable_vae_slicing() # Process images in slices |
|
|
``` |
|
|
|
|
|
3. **Batch Processing** |
|
|
```python |
|
|
# Process multiple images efficiently |
|
|
with torch.no_grad(): |
|
|
latents = vae.encode(batch_images).latent_dist.sample() |
|
|
``` |
|
|
|
|
|
4. **Compile for Speed (PyTorch 2.0+)** |
|
|
```python |
|
|
vae = torch.compile(vae, mode="reduce-overhead") |
|
|
``` |
|
|
|
|
|
### Quality Improvements |
|
|
|
|
|
- **Always use this VAE with SDXL models** for best quality |
|
|
- Particularly improves fine details, textures, and faces |
|
|
- Most noticeable in high-resolution outputs (1024Γ1024+) |
|
|
- Reduces artifacts and improves color accuracy |
|
|
|
|
|
### Memory Management |
|
|
|
|
|
- VAE decoding is memory-intensive for large batches |
|
|
- Use `enable_vae_slicing()` for memory-constrained systems |
|
|
- Consider tiled VAE decoding for resolutions >1024Γ1024 |
|
|
|
|
|
## License |
|
|
|
|
|
**MIT License** |
|
|
|
|
|
Copyright (c) 2023 Stability AI |
|
|
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
|
of this software and associated documentation files (the "Software"), to deal |
|
|
in the Software without restriction, including without limitation the rights |
|
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
|
copies of the Software, and to permit persons to whom the Software is |
|
|
furnished to do so, subject to the following conditions: |
|
|
|
|
|
The above copyright notice and this permission notice shall be included in all |
|
|
copies or substantial portions of the Software. |
|
|
|
|
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE |
|
|
SOFTWARE. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this VAE in your research or projects, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{sdxl-vae-2023, |
|
|
title={SDXL: Improved Variational Autoencoder}, |
|
|
author={Stability AI}, |
|
|
year={2023}, |
|
|
howpublished={\url{https://huggingface.co/stabilityai/sdxl-vae}}, |
|
|
} |
|
|
|
|
|
@article{rombach2022high, |
|
|
title={High-resolution image synthesis with latent diffusion models}, |
|
|
author={Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{\"o}rn}, |
|
|
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, |
|
|
pages={10684--10695}, |
|
|
year={2022} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Resources |
|
|
|
|
|
- **Official Repository**: [stabilityai/sdxl-vae](https://huggingface.co/stabilityai/sdxl-vae) |
|
|
- **SDXL Base Model**: [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) |
|
|
- **Diffusers Documentation**: [huggingface.co/docs/diffusers](https://huggingface.co/docs/diffusers) |
|
|
- **Stable Diffusion Research**: [stability.ai/research](https://stability.ai/research) |
|
|
|
|
|
## Technical Support |
|
|
|
|
|
For issues specific to this VAE: |
|
|
- Check the [official model card](https://huggingface.co/stabilityai/sdxl-vae) |
|
|
- Review [Diffusers VAE documentation](https://huggingface.co/docs/diffusers/api/models/autoencoderkl) |
|
|
- Visit [Stability AI Community](https://stability.ai/community) |
|
|
|
|
|
--- |
|
|
|
|
|
**Version**: v1.0 |
|
|
**Last Updated**: October 2025 |
|
|
**Model Version**: SDXL VAE 1.0 |
|
|
**Maintained By**: Local Hugging Face Repository |
|
|
|