Modular SDXL Upscale

Tiled image upscaling for Stable Diffusion XL using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.

Built with Modular Diffusers, composing reusable SDXL blocks into a tiled upscaling workflow with optional ControlNet Tile conditioning.

Install

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors

Requires diffusers from main (modular diffusers support).

Quick start

from diffusers import ModularPipeline, ControlNetModel
import torch

pipe = ModularPipeline.from_pretrained(
    "akshan-main/modular-sdxl-upscale",
    trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.float16)

controlnet = ControlNetModel.from_pretrained(
    "xinsir/controlnet-tile-sdxl-1.0", torch_dtype=torch.float16
)
pipe.update_components(controlnet=controlnet)
pipe.to("cuda")

image = ...  # your PIL image

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=1.0,
    upscale_factor=2.0,
    num_inference_steps=20,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)
result[0].save("upscaled.png")

How it works

Input image is upscaled to the target resolution using Lanczos interpolation
Upscaled image is encoded to latent space via the SDXL VAE
Noise is added to the latents based on strength
At each denoising timestep, the UNet runs on overlapping latent tiles. Noise predictions from all tiles are blended using boundary-aware cosine weights (MultiDiffusion)
One scheduler step is taken on the full blended prediction
After all timesteps, denoised latents are decoded back to pixel space
For upscale factors above 2x with progressive=True, steps 1-6 repeat as multiple 2x passes

ControlNet Tile is optional but recommended. Without it, the model hallucinate new content instead of enhancing existing detail.

Examples

2x upscale with ControlNet Tile

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=1.0,
    upscale_factor=2.0,
    num_inference_steps=20,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

4x progressive upscale

Automatically splits into two 2x passes. Auto-strength scales denoise strength per pass.

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=1.0,
    upscale_factor=4.0,
    progressive=True,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

To disable progressive mode:

result = pipe(..., upscale_factor=4.0, progressive=False, strength=0.2)

Without ControlNet

For cases where you want the model to add creative detail. Use lower strength.

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    upscale_factor=2.0,
    strength=0.15,
    auto_strength=False,
    num_inference_steps=20,
    output="images",
)

Scheduler selection

result = pipe(..., scheduler_name="DPM++ 2M Karras")
result = pipe(..., scheduler_name="Euler")
result = pipe(..., scheduler_name="DPM++ 2M")

Parameters

Parameter	Default	Description
`image`	required	Input image (PIL)
`prompt`	`""`	Text prompt
`upscale_factor`	`2.0`	Scale multiplier
`strength`	`0.3`	Denoise strength. Lower = closer to input. Ignored when `auto_strength=True`
`num_inference_steps`	`20`	Denoising steps
`guidance_scale`	`7.5`	CFG scale
`latent_tile_size`	`64`	Tile size in latent pixels (64 = 512px)
`latent_overlap`	`16`	Tile overlap in latent pixels (16 = 128px)
`control_image`	`None`	ControlNet conditioning image. Pass the input image for Tile mode
`controlnet_conditioning_scale`	`1.0`	ControlNet strength
`negative_prompt`	auto	Defaults to "blurry, low quality, artifacts, noise, jpeg compression"
`progressive`	`True`	Split upscale_factor > 2 into multiple 2x passes
`auto_strength`	`True`	Auto-scale strength based on upscale factor and pass index
`use_default_negative`	`True`	Apply default negative prompt when none is provided
`scheduler_name`	`None`	Switch scheduler: "Euler", "DPM++ 2M", "DPM++ 2M Karras"
`generator`	`None`	Torch generator for reproducibility
`output`	`"images"`	Output key

Tuning guide

strength — how much the model changes the image.

0.15-0.25: minimal changes, mostly sharpening
0.25-0.35: balanced enhancement (default with auto_strength)
0.4+: significant changes, risk of drift

latent_tile_size — tile size for MultiDiffusion.

64 (512px): works on most GPUs. Recommended
96 (768px): smoother, needs 24GB+ VRAM
Below 64: may produce artifacts due to insufficient context

controlnet_conditioning_scale — ControlNet influence.

1.0: very faithful to input. Recommended
0.7-0.8: slight creative freedom
Below 0.5: too weak, causes hallucination

guidance_scale — CFG strength.

3-5: softer, more natural
7.5: standard
10-12: more contrast

Limitations

SDXL is trained on 1024x1024. Tiles smaller than 512px (latent_tile_size < 64) may produce artifacts
4x from very small inputs (below 256px) produces distortion. Use progressive mode and start from at least 256px
ControlNet Tile is required for faithful upscaling. Without it, the model hallucinate new content
Parameters like guidance_scale, strength, and negative_prompt have subtle visual effects when ControlNet is at scale 1.0. This is by design — the upscaler prioritizes faithfulness
VRAM: 2x upscale of 512 to 1024 needs ~10GB. 4x progressive needs ~14GB peak. Uses fp16 and VAE tiling automatically
Not suitable for upscaling text, line art, or pixel art. Use dedicated upscalers for those

Architecture

MultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
  text_encoder      SDXL TextEncoderStep (reused)
  upscale           Lanczos upscale step
  tile_plan         Tile planning step
  input             SDXL InputStep (reused)
  set_timesteps     SDXL Img2Img SetTimestepsStep (reused)
  multidiffusion    MultiDiffusion step
                    - VAE encode full image
                    - Per timestep: UNet on each latent tile, cosine-weighted blend
                    - VAE decode full latents

8 SDXL blocks reused via public interface, 3 custom blocks added.

Models

Base: stabilityai/stable-diffusion-xl-base-1.0
ControlNet (optional): xinsir/controlnet-tile-sdxl-1.0

References

MultiDiffusion (Bar-Tal et al., 2023) — tiled latent-space blending algorithm
Ultimate Upscale for A1111 — the A1111 extension that inspired tiled upscaling workflows
Tiled Diffusion for A1111 — MultiDiffusion implementation for A1111
Modular Diffusers — the HuggingFace framework this pipeline is built on
ControlNet Tile — tile-conditioned ControlNet for structure-preserving generation

Tested on

Google Colab T4 (16GB VRAM, fp16)
2x: 512x512 to 1024x1024
4x progressive: 256x256 to 1024x1024

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for akshan-main/modular-sdxl-upscale

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Paper • 2302.08113 • Published Feb 16, 2023 • 1