Try this for 25% faster generation

#16
by ykarout - opened

Tested on Blackwell (RTX 5080), 25% faster than native SDPA:
───────────────────────┬────────────┬──────────┐
│ Backend │ Total time │ Per step │
├───────────────────────┼────────────┼──────────┤
│ Native SDPA (default) │ 208.49s │ ~4.17s │
├───────────────────────┼────────────┼──────────┤
│ Flash SDPA │ 156.67s │ ~3.13s │
└───────────────────────┴────────────┴──────────┘
Flash SDPA is ~25% faster — saved about 52 seconds on a 50-step Full HD generation.

Use the code:
'''python
import torch
from diffusers import ZImagePipeline
from diffusers.models.attention_dispatch import attention_backend

Load the pipeline

pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
)
pipe.enable_model_cpu_offload()

Generate image

prompt = "Two young Asian women stand close together against a backdrop of a plain gray textured wall, possibly an indoor carpeted floor. The woman on the left has long, curly hair, wears a navy blue sweater with cream-colored ruffles on the left sleeve, a white stand-up collar shirt underneath, and white trousers; she wears small gold earrings"
negative_prompt = "" # Optional, but would be powerful when you want to remove some unwanted content

with attention_backend("_native_flash"):
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=1920,
width=1088,
cfg_normalization=False,
num_inference_steps=50,
guidance_scale=4,
generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("example.png")
'''

for base model data type bf16 is better then fp16 .Generation images come out more accurate. i did a test im not real sure i forgot but i think fp16 is good for Turbo model

Sign up or log in to comment