Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Paper β’ 2511.22699 β’ Published β’ 245
MLX conversion of Tongyi-MAI/Z-Image-Turbo for Apple Silicon.
This is the full-precision float16 MLX conversion.
Model size: 20.54 GB
| Variant | Size | Quantization | Link |
|---|---|---|---|
| Full Precision (fp16) | 20.54 GB | None | andrevp/Z-Image-Turbo-MLX |
| 8-bit | 11.37 GB | 8-bit, group_size=64 | andrevp/Z-Image-Turbo-MLX-8bit |
| 4-bit | 6.48 GB | 4-bit, group_size=64 | andrevp/Z-Image-Turbo-MLX-4bit |
| 2-bit | 4.04 GB | 2-bit, group_size=64 | andrevp/Z-Image-Turbo-MLX-2bit |
Z-Image is an efficient 6B-parameter image generation foundation model using a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture. Z-Image-Turbo is the distilled variant with only 8 NFEs (Number of Function Evaluations), achieving sub-second inference latency.
| Component | Architecture | Parameters |
|---|---|---|
| Text Encoder | Qwen3 (36 layers, hidden_size=2560, GQA with 32/8 heads) | ~7.8 GB (fp16) |
| Transformer | ZImageTransformer2DModel (30 layers, dim=3840, 30 heads) | ~12.3 GB (fp16) |
| VAE | AutoencoderKL (from Flux, 16 latent channels) | ~160 MB (fp16) |
| Tokenizer | Qwen2Tokenizer (vocab_size=151,936) | β |
| Scheduler | FlowMatchEulerDiscreteScheduler | β |
The S3-DiT architecture concatenates text tokens, visual semantic tokens, and image VAE tokens at the sequence level as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches.
| Component | Original (bf16) | This Variant (Full Precision (float16)) |
|---|---|---|
| Text Encoder | 7.8 GB | ~8.0 GB |
| Transformer | 24.6 GB | ~12.3 GB |
| VAE | 160 MB | 160 MB |
| Total | ~32.6 GB | 20.54 GB |
import torch
from diffusers import ZImagePipeline
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-Turbo",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
prompt = "Young Chinese woman in red Hanfu, intricate embroidery, ancient temple backdrop"
image = pipe(
prompt=prompt,
height=1024,
width=1024,
num_inference_steps=9, # Results in 8 DiT forwards
guidance_scale=0.0, # No CFG for Turbo models
generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("example.png")
@article{z-image2025,
title={Z-Image: An Efficient Image Generation Foundation Model with Scalable Single Stream Diffusion Transformer},
author={Tongyi MAI Team},
journal={arXiv preprint arXiv:2511.22699},
year={2025}
}
@article{decoupled-dmd2025,
title={Decoupled Consistency Model Distillation},
author={Liu et al.},
journal={arXiv preprint arXiv:2511.22677},
year={2025}
}
@article{dmdr2025,
title={DMDR: Fusing DMD with Reinforcement Learning},
author={Jiang et al.},
journal={arXiv preprint arXiv:2511.13649},
year={2025}
}
Quantized
Base model
Tongyi-MAI/Z-Image-Turbo