Upload README.md with huggingface_hub

c55398c verified 12 days ago

7.04 kB

library_name: diffusers
tags:
  - modular-diffusers
  - diffusers
  - z-image
  - upscale
  - tiling
  - multidiffusion

Modular Z-Image Upscale

Tiled image upscaling for Z-Image using MultiDiffusion latent-space blending. Produces seamless upscaled output without tile boundary artifacts.

Built with Modular Diffusers, composing reusable Z-Image blocks into a tiled upscaling workflow.

Install

pip install git+https://github.com/huggingface/diffusers.git transformers accelerate safetensors

Requires diffusers from main (modular diffusers support).

Quick start

from diffusers import ModularPipeline
import torch

pipe = ModularPipeline.from_pretrained(
    "akshan-main/modular-zimage-upscale",
    trust_remote_code=True,
)
pipe.load_components(torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = ...  # your PIL image

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    scale_factor=2.0,
    num_inference_steps=8,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)
result[0].save("upscaled.png")

How it works

Input image is upscaled to the target resolution using Lanczos interpolation
Upscaled image is encoded to latent space via the Z-Image VAE
Noise is added to the latents based on strength
At each denoising timestep, the transformer runs on overlapping latent tiles. Noise predictions from all tiles are blended using boundary-aware cosine weights (MultiDiffusion)
One scheduler step is taken on the full blended prediction
After all timesteps, denoised latents are decoded back to pixel space

Examples

2x upscale

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    scale_factor=2.0,
    num_inference_steps=8,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

4x upscale

For Z-Image, single-pass 4x tends to be more faithful than progressive multi-pass.

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    scale_factor=4.0,
    progressive=False,
    generator=torch.Generator("cuda").manual_seed(42),
    output="images",
)

With ControlNet (optional)

Z-Image's ControlNet Union is general-purpose and not tile-specific. In testing, results without ControlNet were comparable or closer to the input. ControlNet is supported but not required.

from diffusers.models.controlnets import ZImageControlNetModel
from huggingface_hub import hf_hub_download

controlnet = ZImageControlNetModel.from_single_file(
    hf_hub_download(
        "alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union",
        filename="Z-Image-Turbo-Fun-Controlnet-Union.safetensors",
    ),
    torch_dtype=torch.bfloat16,
)
controlnet = ZImageControlNetModel.from_transformer(controlnet, pipe.transformer)
pipe.update_components(controlnet=controlnet)

result = pipe(
    prompt="high quality, detailed, sharp",
    image=image,
    control_image=image,
    controlnet_conditioning_scale=0.75,
    scale_factor=2.0,
    output="images",
)

Parameters

Parameter	Default	Description
`image`	required	Input image (PIL)
`prompt`	`""`	Text prompt
`negative_prompt`	`None`	Negative text prompt. Limited effect since CFG is disabled by default
`scale_factor`	`2.0`	Scale multiplier
`strength`	`0.4`	Denoise strength. Lower = closer to input
`num_inference_steps`	`8`	Denoising steps. Z-Image Turbo converges quickly, 4-8 is sufficient
`tile_size`	`64`	Tile size in latent pixels
`tile_overlap`	`8`	Tile overlap in latent pixels
`control_image`	`None`	ControlNet conditioning image (optional)
`controlnet_conditioning_scale`	`0.75`	ControlNet strength
`progressive`	`True`	Split upscale_factor > 2 into multiple passes. For Z-Image, `False` often works better
`auto_strength`	`True`	Auto-scale strength based on upscale factor and pass index
`generator`	`None`	Torch generator for reproducibility
`output`	`"images"`	Output key

Observations

Z-Image Turbo is distilled for fast inference. Results converge by 4 steps, and 4 vs 16 steps produce nearly identical output
Strength, ControlNet scale, and negative prompt have subtle effects with this model. The distilled nature means it converges quickly regardless of parameter tuning
Z-Image's ControlNet Union is general-purpose, not tile-specific like SDXL's. In testing, running without ControlNet produced results as close or closer to the input
Progressive mode compounds drift across passes. Single-pass 4x is more faithful for Z-Image
No visible tile seams at any tile size. MultiDiffusion blending works cleanly
Gradient stress test passes with no banding or artifacts

Limitations

Z-Image Turbo (6B) needs a GPU with sufficient VRAM
Z-Image's ControlNet Union is general-purpose and may not improve upscaling faithfulness compared to running without ControlNet. For faithful results, use lower strength values
Progressive mode compounds drift across passes. Use progressive=False for more faithful results
Tiles smaller than 32 latent pixels may produce artifacts
4x from very small inputs (below 256px) produces distortion. Start from at least 256px
Z-Image Turbo's CFG is disabled by default. Negative prompts have limited effect
Not suitable for upscaling text, line art, or pixel art

Architecture

MultiDiffusionUpscaleBlocks (SequentialPipelineBlocks)
  text_encoder    ZImageTextEncoderStep (reused)
  upscale         ZImageUpscaleStep (Lanczos)
  multidiffusion  ZImageMultiDiffusionStep
                  - VAE encode full image
                  - Per timestep: transformer on each latent tile (+optional ControlNet), cosine-weighted blend
                  - VAE decode full latents

Models

Base: Tongyi-MAI/Z-Image-Turbo
ControlNet (optional): alibaba-pai/Z-Image-Turbo-Fun-Controlnet-Union

References

MultiDiffusion (Bar-Tal et al., 2023) - tiled latent-space blending algorithm
Modular Diffusers - the HuggingFace framework this pipeline is built on
Modular Diffusers contribution call

Tested on

Google Colab A100 (80GB VRAM, bfloat16)
2x: 512x512 to 1024x1024
4x: 256x256 to 1024x1024