Modular SDXL Upscale

Tiled image upscaling for Stable Diffusion XL using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.

Built with Modular Diffusers, composing reusable SDXL blocks into a tiled upscaling workflow with optional ControlNet Tile conditioning.

Open In Colab GitHub

Install

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors

Requires diffusers from main (modular diffusers support).

Quick start

from diffusers import ModularPipeline, ControlNetModel
import torch

pipe = ModularPipeline.from_pretrained(
    "akshan-main/modular-sdxl-upscale",
    trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.float16)

controlnet = ControlNetModel.from_pretrained(
    "xinsir/controlnet-tile-sdxl-1.0", torch_dtype=torch.float16
)
pipe.update_components(controlnet=controlnet)
pipe.to("cuda")

image = ...  # your PIL image

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=1.0,
    upscale_factor=2.0,
    num_inference_steps=20,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)
result[0].save("upscaled.png")

How it works

  1. Input image is upscaled to the target resolution using Lanczos interpolation
  2. Upscaled image is encoded to latent space via the SDXL VAE
  3. Noise is added to the latents based on strength
  4. At each denoising timestep, the UNet runs on overlapping latent tiles. Noise predictions from all tiles are blended using boundary-aware cosine weights (MultiDiffusion)
  5. One scheduler step is taken on the full blended prediction
  6. After all timesteps, denoised latents are decoded back to pixel space
  7. For upscale factors above 2x with progressive=True, steps 1-6 repeat as multiple 2x passes

ControlNet Tile is optional but recommended. Without it, the model hallucinate new content instead of enhancing existing detail.

Examples

2x upscale with ControlNet Tile

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=1.0,
    upscale_factor=2.0,
    num_inference_steps=20,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

4x progressive upscale

Automatically splits into two 2x passes. Auto-strength scales denoise strength per pass.

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=1.0,
    upscale_factor=4.0,
    progressive=True,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

To disable progressive mode:

result = pipe(..., upscale_factor=4.0, progressive=False, strength=0.2)

Without ControlNet

For cases where you want the model to add creative detail. Use lower strength.

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    upscale_factor=2.0,
    strength=0.15,
    auto_strength=False,
    num_inference_steps=20,
    output="images",
)

Scheduler selection

result = pipe(..., scheduler_name="DPM++ 2M Karras")
result = pipe(..., scheduler_name="Euler")
result = pipe(..., scheduler_name="DPM++ 2M")

Parameters

Parameter Default Description
image required Input image (PIL)
prompt "" Text prompt
upscale_factor 2.0 Scale multiplier
strength 0.3 Denoise strength. Lower = closer to input. Ignored when auto_strength=True
num_inference_steps 20 Denoising steps
guidance_scale 7.5 CFG scale
latent_tile_size 64 Tile size in latent pixels (64 = 512px)
latent_overlap 16 Tile overlap in latent pixels (16 = 128px)
control_image None ControlNet conditioning image. Pass the input image for Tile mode
controlnet_conditioning_scale 1.0 ControlNet strength
negative_prompt auto Defaults to "blurry, low quality, artifacts, noise, jpeg compression"
progressive True Split upscale_factor > 2 into multiple 2x passes
auto_strength True Auto-scale strength based on upscale factor and pass index
use_default_negative True Apply default negative prompt when none is provided
scheduler_name None Switch scheduler: "Euler", "DPM++ 2M", "DPM++ 2M Karras"
generator None Torch generator for reproducibility
output "images" Output key

Tuning guide

strength β€” how much the model changes the image.

  • 0.15-0.25: minimal changes, mostly sharpening
  • 0.25-0.35: balanced enhancement (default with auto_strength)
  • 0.4+: significant changes, risk of drift

latent_tile_size β€” tile size for MultiDiffusion.

  • 64 (512px): works on most GPUs. Recommended
  • 96 (768px): smoother, needs 24GB+ VRAM
  • Below 64: may produce artifacts due to insufficient context

controlnet_conditioning_scale β€” ControlNet influence.

  • 1.0: very faithful to input. Recommended
  • 0.7-0.8: slight creative freedom
  • Below 0.5: too weak, causes hallucination

guidance_scale β€” CFG strength.

  • 3-5: softer, more natural
  • 7.5: standard
  • 10-12: more contrast

Limitations

  • SDXL is trained on 1024x1024. Tiles smaller than 512px (latent_tile_size < 64) may produce artifacts
  • 4x from very small inputs (below 256px) produces distortion. Use progressive mode and start from at least 256px
  • ControlNet Tile is required for faithful upscaling. Without it, the model hallucinate new content
  • Parameters like guidance_scale, strength, and negative_prompt have subtle visual effects when ControlNet is at scale 1.0. This is by design β€” the upscaler prioritizes faithfulness
  • VRAM: 2x upscale of 512 to 1024 needs ~10GB. 4x progressive needs ~14GB peak. Uses fp16 and VAE tiling automatically
  • Not suitable for upscaling text, line art, or pixel art. Use dedicated upscalers for those

Architecture

MultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
  text_encoder      SDXL TextEncoderStep (reused)
  upscale           Lanczos upscale step
  tile_plan         Tile planning step
  input             SDXL InputStep (reused)
  set_timesteps     SDXL Img2Img SetTimestepsStep (reused)
  multidiffusion    MultiDiffusion step
                    - VAE encode full image
                    - Per timestep: UNet on each latent tile, cosine-weighted blend
                    - VAE decode full latents

8 SDXL blocks reused via public interface, 3 custom blocks added.

Models

References

Tested on

  • Google Colab T4 (16GB VRAM, fp16)
  • 2x: 512x512 to 1024x1024
  • 4x progressive: 256x256 to 1024x1024
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for akshan-main/modular-sdxl-upscale