MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
Paper • 2302.08113 • Published • 1
Tiled image upscaling for Z-Image using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.
Built with Modular Diffusers, composing reusable Z-Image blocks into a tiled upscaling workflow.
pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors
Requires diffusers from main (modular diffusers support).
from diffusers import ModularPipeline
import torch
pipe = ModularPipeline.from_pretrained(
"akshan-main/modular-zimage-upscale",
trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
image = ... # your PIL image
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
scale_factor=2.0,
num_inference_steps=8,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
result[0].save("upscaled.png")
strengthresult = pipe(
prompt="high quality, detailed, sharp",
image=image,
scale_factor=2.0,
num_inference_steps=8,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
For Z-Image, single-pass 4x tends to be more faithful than progressive multi-pass.
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
scale_factor=4.0,
progressive=False,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
Z-Image's ControlNet Union is general-purpose and not tile-specific. In testing, results without ControlNet were comparable or closer to the input. ControlNet is supported but not required.
from diffusers.models.controlnets import ZImageControlNetModel
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
),
torch_dtype=torch.bfloat16,
)
controlnet = ZImageControlNetModel.from_transformer(controlnet, pipe.transformer)
pipe.update_components(controlnet=controlnet)
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
control_image=image,
controlnet_conditioning_scale=0.75,
scale_factor=2.0,
output="images",
)
| Parameter | Default | Description |
|---|---|---|
image |
required | Input image (PIL) |
prompt |
"" |
Text prompt |
negative_prompt |
None |
Negative text prompt. Limited effect since CFG is disabled by default |
scale_factor |
2.0 |
Scale multiplier |
strength |
0.4 |
Denoise strength. Lower = closer to input |
num_inference_steps |
8 |
Denoising steps. Z-Image Turbo converges quickly, 4-8 is sufficient |
tile_size |
64 |
Tile size in latent pixels |
tile_overlap |
8 |
Tile overlap in latent pixels |
control_image |
None |
ControlNet conditioning image (optional) |
controlnet_conditioning_scale |
0.75 |
ControlNet strength |
progressive |
True |
Split upscale_factor > 2 into multiple passes. For Z-Image, False often works better |
auto_strength |
True |
Auto-scale strength based on upscale factor and pass index |
generator |
None |
Torch generator for reproducibility |
output |
"images" |
Output key |
progressive=False for more faithful resultsMultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
text_encoder ZImageTextEncoderStep (reused)
upscale ZImageUpscaleStep (Lanczos)
multidiffusion ZImageMultiDiffusionStep
- VAE encode full image
- Per timestep: transformer on each latent tile (+optional ControlNet), cosine-weighted blend
- VAE decode full latents