Modular Z-Image Upscale

Tiled image upscaling for Z-Image using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.

Built with Modular Diffusers, composing reusable Z-Image blocks into a tiled upscaling workflow.

Open In Colab GitHub

Install

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors

Requires diffusers from main (modular diffusers support).

Quick start

from diffusers import ModularPipeline
import torch

pipe = ModularPipeline.from_pretrained(
    "akshan-main/modular-zimage-upscale",
    trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = ...  # your PIL image

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    scale_factor=2.0,
    num_inference_steps=8,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)
result[0].save("upscaled.png")

How it works

  1. Input image is upscaled to the target resolution using Lanczos interpolation
  2. Upscaled image is encoded to latent space via the Z-Image VAE
  3. Noise is added to the latents based on strength
  4. At each denoising timestep, the transformer runs on overlapping latent tiles. Noise predictions from all tiles are blended using boundary-aware cosine weights (MultiDiffusion)
  5. One scheduler step is taken on the full blended prediction
  6. After all timesteps, denoised latents are decoded back to pixel space

Examples

2x upscale

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    scale_factor=2.0,
    num_inference_steps=8,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

4x upscale

For Z-Image, single-pass 4x tends to be more faithful than progressive multi-pass.

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    scale_factor=4.0,
    progressive=False,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

With ControlNet (optional)

Z-Image's ControlNet Union is general-purpose and not tile-specific. In testing, results without ControlNet were comparable or closer to the input. ControlNet is supported but not required.

from diffusers.models.controlnets import ZImageControlNetModel
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
        filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)
controlnet = ZImageControlNetModel.from_transformer(controlnet, pipe.transformer)
pipe.update_components(controlnet=controlnet)

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=0.75,
    scale_factor=2.0,
    output="images",
)

Parameters

Parameter Default Description
image required Input image (PIL)
prompt "" Text prompt
negative_prompt None Negative text prompt. Limited effect since CFG is disabled by default
scale_factor 2.0 Scale multiplier
strength 0.4 Denoise strength. Lower = closer to input
num_inference_steps 8 Denoising steps. Z-Image Turbo converges quickly, 4-8 is sufficient
tile_size 64 Tile size in latent pixels
tile_overlap 8 Tile overlap in latent pixels
control_image None ControlNet conditioning image (optional)
controlnet_conditioning_scale 0.75 ControlNet strength
progressive True Split upscale_factor > 2 into multiple passes. For Z-Image, False often works better
auto_strength True Auto-scale strength based on upscale factor and pass index
generator None Torch generator for reproducibility
output "images" Output key

Observations

  • Z-Image Turbo is distilled for fast inference. Results converge by 4 steps, and 4 vs 16 steps produce nearly identical output
  • Strength, ControlNet scale, and negative prompt have subtle effects with this model. The distilled nature means it converges quickly regardless of parameter tuning
  • Z-Image's ControlNet Union is general-purpose, not tile-specific like SDXL's. In testing, running without ControlNet produced results as close or closer to the input
  • Progressive mode compounds drift across passes. Single-pass 4x is more faithful for Z-Image
  • No visible tile seams at any tile size. MultiDiffusion blending works cleanly
  • Gradient stress test passes with no banding or artifacts

Limitations

  • Z-Image Turbo (6B) needs a GPU with sufficient VRAM
  • Z-Image's ControlNet Union is general-purpose and may not improve upscaling faithfulness compared to running without ControlNet. For faithful results, use lower strength values
  • Progressive mode compounds drift across passes. Use progressive=False for more faithful results
  • Tiles smaller than 32 latent pixels may produce artifacts
  • 4x from very small inputs (below 256px) produces distortion. Start from at least 256px
  • Z-Image Turbo's CFG is disabled by default. Negative prompts have limited effect
  • Not suitable for upscaling text, line art, or pixel art

Architecture

MultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
  text_encoder    ZImageTextEncoderStep (reused)
  upscale         ZImageUpscaleStep (Lanczos)
  multidiffusion  ZImageMultiDiffusionStep
                  - VAE encode full image
                  - Per timestep: transformer on each latent tile (+optional ControlNet), cosine-weighted blend
                  - VAE decode full latents

Models

References

Tested on

  • Google Colab A100 (80GB VRAM, bfloat16)
  • 2x: 512x512 to 1024x1024
  • 4x: 256x256 to 1024x1024
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for akshan-main/modular-zimage-upscale