Modular SDXL Upscale
Tiled image upscaling for Stable Diffusion XL using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.
Built with Modular Diffusers, composing reusable SDXL blocks into a tiled upscaling workflow with optional ControlNet Tile conditioning.
Install
pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors
Requires diffusers from main (modular diffusers support).
Quick start
from diffusers import ModularPipeline, ControlNetModel
import torch
pipe = ModularPipeline.from_pretrained(
"akshan-main/modular-sdxl-upscale",
trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.float16)
controlnet = ControlNetModel.from_pretrained(
"xinsir/controlnet-tile-sdxl-1.0", torch_dtype=torch.float16
)
pipe.update_components(controlnet=controlnet)
pipe.to("cuda")
image = ... # your PIL image
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
control_image=image,
controlnet_conditioning_scale=1.0,
upscale_factor=2.0,
num_inference_steps=20,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
result[0].save("upscaled.png")
How it works
- Input image is upscaled to the target resolution using Lanczos interpolation
- Upscaled image is encoded to latent space via the SDXL VAE
- Noise is added to the latents based on
strength - At each denoising timestep, the UNet runs on overlapping latent tiles. Noise predictions from all tiles are blended using boundary-aware cosine weights (MultiDiffusion)
- One scheduler step is taken on the full blended prediction
- After all timesteps, denoised latents are decoded back to pixel space
- For upscale factors above 2x with
progressive=True, steps 1-6 repeat as multiple 2x passes
ControlNet Tile is optional but recommended. Without it, the model hallucinate new content instead of enhancing existing detail.
Examples
2x upscale with ControlNet Tile
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
control_image=image,
controlnet_conditioning_scale=1.0,
upscale_factor=2.0,
num_inference_steps=20,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
4x progressive upscale
Automatically splits into two 2x passes. Auto-strength scales denoise strength per pass.
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
control_image=image,
controlnet_conditioning_scale=1.0,
upscale_factor=4.0,
progressive=True,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
To disable progressive mode:
result = pipe(..., upscale_factor=4.0, progressive=False, strength=0.2)
Without ControlNet
For cases where you want the model to add creative detail. Use lower strength.
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
upscale_factor=2.0,
strength=0.15,
auto_strength=False,
num_inference_steps=20,
output="images",
)
Scheduler selection
result = pipe(..., scheduler_name="DPM++ 2M Karras")
result = pipe(..., scheduler_name="Euler")
result = pipe(..., scheduler_name="DPM++ 2M")
Parameters
| Parameter | Default | Description |
|---|---|---|
image |
required | Input image (PIL) |
prompt |
"" |
Text prompt |
upscale_factor |
2.0 |
Scale multiplier |
strength |
0.3 |
Denoise strength. Lower = closer to input. Ignored when auto_strength=True |
num_inference_steps |
20 |
Denoising steps |
guidance_scale |
7.5 |
CFG scale |
latent_tile_size |
64 |
Tile size in latent pixels (64 = 512px) |
latent_overlap |
16 |
Tile overlap in latent pixels (16 = 128px) |
control_image |
None |
ControlNet conditioning image. Pass the input image for Tile mode |
controlnet_conditioning_scale |
1.0 |
ControlNet strength |
negative_prompt |
auto | Defaults to "blurry, low quality, artifacts, noise, jpeg compression" |
progressive |
True |
Split upscale_factor > 2 into multiple 2x passes |
auto_strength |
True |
Auto-scale strength based on upscale factor and pass index |
use_default_negative |
True |
Apply default negative prompt when none is provided |
scheduler_name |
None |
Switch scheduler: "Euler", "DPM++ 2M", "DPM++ 2M Karras" |
generator |
None |
Torch generator for reproducibility |
output |
"images" |
Output key |
Tuning guide
strength β how much the model changes the image.
- 0.15-0.25: minimal changes, mostly sharpening
- 0.25-0.35: balanced enhancement (default with auto_strength)
- 0.4+: significant changes, risk of drift
latent_tile_size β tile size for MultiDiffusion.
- 64 (512px): works on most GPUs. Recommended
- 96 (768px): smoother, needs 24GB+ VRAM
- Below 64: may produce artifacts due to insufficient context
controlnet_conditioning_scale β ControlNet influence.
- 1.0: very faithful to input. Recommended
- 0.7-0.8: slight creative freedom
- Below 0.5: too weak, causes hallucination
guidance_scale β CFG strength.
- 3-5: softer, more natural
- 7.5: standard
- 10-12: more contrast
Limitations
- SDXL is trained on 1024x1024. Tiles smaller than 512px (
latent_tile_size < 64) may produce artifacts - 4x from very small inputs (below 256px) produces distortion. Use progressive mode and start from at least 256px
- ControlNet Tile is required for faithful upscaling. Without it, the model hallucinate new content
- Parameters like
guidance_scale,strength, andnegative_prompthave subtle visual effects when ControlNet is at scale 1.0. This is by design β the upscaler prioritizes faithfulness - VRAM: 2x upscale of 512 to 1024 needs ~10GB. 4x progressive needs ~14GB peak. Uses fp16 and VAE tiling automatically
- Not suitable for upscaling text, line art, or pixel art. Use dedicated upscalers for those
Architecture
MultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
text_encoder SDXL TextEncoderStep (reused)
upscale Lanczos upscale step
tile_plan Tile planning step
input SDXL InputStep (reused)
set_timesteps SDXL Img2Img SetTimestepsStep (reused)
multidiffusion MultiDiffusion step
- VAE encode full image
- Per timestep: UNet on each latent tile, cosine-weighted blend
- VAE decode full latents
8 SDXL blocks reused via public interface, 3 custom blocks added.
Models
- Base: stabilityai/stable-diffusion-xl-base-1.0
- ControlNet (optional): xinsir/controlnet-tile-sdxl-1.0
References
- MultiDiffusion (Bar-Tal et al., 2023) β tiled latent-space blending algorithm
- Ultimate Upscale for A1111 β the A1111 extension that inspired tiled upscaling workflows
- Tiled Diffusion for A1111 β MultiDiffusion implementation for A1111
- Modular Diffusers β the HuggingFace framework this pipeline is built on
- ControlNet Tile β tile-conditioned ControlNet for structure-preserving generation
Tested on
- Google Colab T4 (16GB VRAM, fp16)
- 2x: 512x512 to 1024x1024
- 4x progressive: 256x256 to 1024x1024
- Downloads last month
- -