metadata
library_name: diffusers
tags:
- modular-diffusers
- diffusers
- z-image
- upscale
- tiling
- multidiffusion
Modular Z-Image Upscale
Tiled image upscaling for Z-Image using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.
Built with Modular Diffusers, composing reusable Z-Image blocks into a tiled upscaling workflow.
Install
pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors
Requires diffusers from main (modular diffusers support).
Quick start
from diffusers import ModularPipeline
import torch
pipe = ModularPipeline.from_pretrained(
"akshan-main/modular-zimage-upscale",
trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")
image = ... # your PIL image
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
scale_factor=2.0,
num_inference_steps=8,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
result[0].save("upscaled.png")
How it works
- Input image is upscaled to the target resolution using Lanczos interpolation
- Upscaled image is encoded to latent space via the Z-Image VAE
- Noise is added to the latents based on
strength - At each denoising timestep, the transformer runs on overlapping latent tiles. Noise predictions from all tiles are blended using boundary-aware cosine weights (MultiDiffusion)
- One scheduler step is taken on the full blended prediction
- After all timesteps, denoised latents are decoded back to pixel space
Examples
2x upscale
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
scale_factor=2.0,
num_inference_steps=8,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
4x upscale
For Z-Image, single-pass 4x tends to be more faithful than progressive multi-pass.
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
scale_factor=4.0,
progressive=False,
generator=torch.Generator("cuda").manual_seed(42),
output="images",
)
With ControlNet (optional)
Z-Image's ControlNet Union is general-purpose and not tile-specific. In testing, results without ControlNet were comparable or closer to the input. ControlNet is supported but not required.
from diffusers.models.controlnets import ZImageControlNetModel
from huggingface_hub import hf_hub_download
controlnet = ZImageControlNetModel.from_single_file(
hf_hub_download(
"alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
),
torch_dtype=torch.bfloat16,
)
controlnet = ZImageControlNetModel.from_transformer(controlnet, pipe.transformer)
pipe.update_components(controlnet=controlnet)
result = pipe(
prompt="high quality, detailed, sharp",
image=image,
control_image=image,
controlnet_conditioning_scale=0.75,
scale_factor=2.0,
output="images",
)
Parameters
| Parameter | Default | Description |
|---|---|---|
image |
required | Input image (PIL) |
prompt |
"" |
Text prompt |
negative_prompt |
None |
Negative text prompt. Limited effect since CFG is disabled by default |
scale_factor |
2.0 |
Scale multiplier |
strength |
0.4 |
Denoise strength. Lower = closer to input |
num_inference_steps |
8 |
Denoising steps. Z-Image Turbo converges quickly, 4-8 is sufficient |
tile_size |
64 |
Tile size in latent pixels |
tile_overlap |
8 |
Tile overlap in latent pixels |
control_image |
None |
ControlNet conditioning image (optional) |
controlnet_conditioning_scale |
0.75 |
ControlNet strength |
progressive |
True |
Split upscale_factor > 2 into multiple passes. For Z-Image, False often works better |
auto_strength |
True |
Auto-scale strength based on upscale factor and pass index |
generator |
None |
Torch generator for reproducibility |
output |
"images" |
Output key |
Observations
- Z-Image Turbo is distilled for fast inference. Results converge by 4 steps, and 4 vs 16 steps produce nearly identical output
- Strength, ControlNet scale, and negative prompt have subtle effects with this model. The distilled nature means it converges quickly regardless of parameter tuning
- Z-Image's ControlNet Union is general-purpose, not tile-specific like SDXL's. In testing, running without ControlNet produced results as close or closer to the input
- Progressive mode compounds drift across passes. Single-pass 4x is more faithful for Z-Image
- No visible tile seams at any tile size. MultiDiffusion blending works cleanly
- Gradient stress test passes with no banding or artifacts
Limitations
- Z-Image Turbo (6B) needs a GPU with sufficient VRAM
- Z-Image's ControlNet Union is general-purpose and may not improve upscaling faithfulness compared to running without ControlNet. For faithful results, use lower strength values
- Progressive mode compounds drift across passes. Use
progressive=Falsefor more faithful results - Tiles smaller than 32 latent pixels may produce artifacts
- 4x from very small inputs (below 256px) produces distortion. Start from at least 256px
- Z-Image Turbo's CFG is disabled by default. Negative prompts have limited effect
- Not suitable for upscaling text, line art, or pixel art
Architecture
MultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
text_encoder ZImageTextEncoderStep (reused)
upscale ZImageUpscaleStep (Lanczos)
multidiffusion ZImageMultiDiffusionStep
- VAE encode full image
- Per timestep: transformer on each latent tile (+optional ControlNet), cosine-weighted blend
- VAE decode full latents
Models
- Base: Tongyi-MAI/Z-Image-Turbo
- ControlNet (optional): alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union
References
- MultiDiffusion (Bar-Tal et al., 2023) - tiled latent-space blending algorithm
- Modular Diffusers - the HuggingFace framework this pipeline is built on
- Modular Diffusers contribution call
Tested on
- Google Colab A100 (80GB VRAM, bfloat16)
- 2x: 512x512 to 1024x1024
- 4x: 256x256 to 1024x1024