andrevp commited on
Commit
32b4e9c
·
verified ·
1 Parent(s): 8ef4235

Add comprehensive model card README

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mlx
3
+ license: apache-2.0
4
+ base_model: Tongyi-MAI/Z-Image-Turbo
5
+ tags:
6
+ - mlx
7
+ - diffusers
8
+ - safetensors
9
+ - text-to-image
10
+ - apple-silicon
11
+ - image-generation
12
+ pipeline_tag: text-to-image
13
+ language:
14
+ - en
15
+ - zh
16
+ ---
17
+
18
+ # Z-Image-Turbo — MLX (2-bit Quantized)
19
+
20
+ > MLX conversion of [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) for Apple Silicon.
21
+
22
+ This is the **2-bit quantized** MLX conversion. Linear layer weights are quantized to 2-bit with group_size=64. VAE remains in float16 to preserve image quality. Note: 2-bit quantization may result in noticeable quality degradation.
23
+
24
+ **Model size: 4.04 GB**
25
+
26
+ ## All Available MLX Variants
27
+
28
+ | Variant | Size | Quantization | Link |
29
+ |---------|------|-------------|------|
30
+ | **Full Precision (fp16)** | 20.54 GB | None | [andrevp/Z-Image-Turbo-MLX](https://huggingface.co/andrevp/Z-Image-Turbo-MLX) |
31
+ | **8-bit** | 11.37 GB | 8-bit, group_size=64 | [andrevp/Z-Image-Turbo-MLX-8bit](https://huggingface.co/andrevp/Z-Image-Turbo-MLX-8bit) |
32
+ | **4-bit** | 6.48 GB | 4-bit, group_size=64 | [andrevp/Z-Image-Turbo-MLX-4bit](https://huggingface.co/andrevp/Z-Image-Turbo-MLX-4bit) |
33
+ | **2-bit** | 4.04 GB | 2-bit, group_size=64 | [andrevp/Z-Image-Turbo-MLX-2bit](https://huggingface.co/andrevp/Z-Image-Turbo-MLX-2bit) |
34
+
35
+ ## About Z-Image-Turbo
36
+
37
+ Z-Image is an efficient 6B-parameter image generation foundation model using a **Scalable Single-Stream Diffusion Transformer (S3-DiT)** architecture. Z-Image-Turbo is the distilled variant with only **8 NFEs** (Number of Function Evaluations), achieving sub-second inference latency.
38
+
39
+ ### Key Features
40
+
41
+ - **Photorealistic image generation** with state-of-the-art quality
42
+ - **Bilingual text rendering** (English & Chinese)
43
+ - **Strong instruction adherence**
44
+ - **8-step inference** — distilled via Decoupled-DMD + Reinforcement Learning (DMDR)
45
+ - **No CFG required** — guidance_scale=0.0
46
+
47
+ ## Architecture
48
+
49
+ | Component | Architecture | Parameters |
50
+ |-----------|-------------|------------|
51
+ | **Text Encoder** | Qwen3 (36 layers, hidden_size=2560, GQA with 32/8 heads) | ~7.8 GB (fp16) |
52
+ | **Transformer** | ZImageTransformer2DModel (30 layers, dim=3840, 30 heads) | ~12.3 GB (fp16) |
53
+ | **VAE** | AutoencoderKL (from Flux, 16 latent channels) | ~160 MB (fp16) |
54
+ | **Tokenizer** | Qwen2Tokenizer (vocab_size=151,936) | — |
55
+ | **Scheduler** | FlowMatchEulerDiscreteScheduler | — |
56
+
57
+ The S3-DiT architecture concatenates text tokens, visual semantic tokens, and image VAE tokens at the sequence level as a unified input stream, maximizing parameter efficiency compared to dual-stream approaches.
58
+
59
+ ## Quantization Details
60
+
61
+ | Parameter | Value |
62
+ |-----------|-------|
63
+ | Bits | 2 |
64
+ | Group Size | 64 |
65
+ | Quantized Components | Text Encoder (Qwen3), Transformer (ZImageTransformer2DModel) |
66
+ | Non-Quantized Components | VAE (AutoencoderKL) — kept at float16 for image quality |
67
+ | Quantized Tensors | 526 Linear layer weight tensors |
68
+ | Method | MLX group quantization (`mlx.core.quantize`) |
69
+
70
+ Only 2D weight tensors from Linear layers are quantized. Normalization layers, biases, embeddings,
71
+ and position encodings remain in float16.
72
+
73
+ ## Component Sizes
74
+
75
+ | Component | Original (bf16) | This Variant (2-bit Quantized) |
76
+ |-----------|----------------|----------------------------------|
77
+ | Text Encoder | 7.8 GB | ~1.9 GB |
78
+ | Transformer | 24.6 GB | ~1.9 GB |
79
+ | VAE | 160 MB | 160 MB |
80
+ | **Total** | **~32.6 GB** | **4.04 GB** |
81
+
82
+ ## Original Model
83
+
84
+ - **Source**: [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo)
85
+ - **Authors**: Tongyi MAI Team (Alibaba)
86
+ - **License**: Apache 2.0
87
+ - **Papers**:
88
+ - Z-Image: [arXiv:2511.22699](https://arxiv.org/abs/2511.22699)
89
+ - Decoupled-DMD: [arXiv:2511.22677](https://arxiv.org/abs/2511.22677)
90
+ - DMDR: [arXiv:2511.13649](https://arxiv.org/abs/2511.13649)
91
+
92
+ ## Original Usage (PyTorch/CUDA)
93
+
94
+ ```python
95
+ import torch
96
+ from diffusers import ZImagePipeline
97
+
98
+ pipe = ZImagePipeline.from_pretrained(
99
+ "Tongyi-MAI/Z-Image-Turbo",
100
+ torch_dtype=torch.bfloat16,
101
+ )
102
+ pipe.to("cuda")
103
+
104
+ prompt = "Young Chinese woman in red Hanfu, intricate embroidery, ancient temple backdrop"
105
+
106
+ image = pipe(
107
+ prompt=prompt,
108
+ height=1024,
109
+ width=1024,
110
+ num_inference_steps=9, # Results in 8 DiT forwards
111
+ guidance_scale=0.0, # No CFG for Turbo models
112
+ generator=torch.Generator("cuda").manual_seed(42),
113
+ ).images[0]
114
+
115
+ image.save("example.png")
116
+ ```
117
+
118
+ ## Conversion Details
119
+
120
+ - Converted using MLX {0.30.6} on Apple Silicon
121
+ - Weights converted from bfloat16 to float16
122
+ - SafeTensors format (MLX-compatible)
123
+ - All weight keys preserved and verified
124
+ - VAE kept at float16 across all quantization levels
125
+ - Verified: no NaN/Inf values, all shapes consistent, all index files valid
126
+
127
+ ## Citation
128
+
129
+ ```bibtex
130
+ @article{z-image2025,
131
+ title={Z-Image: An Efficient Image Generation Foundation Model with Scalable Single Stream Diffusion Transformer},
132
+ author={Tongyi MAI Team},
133
+ journal={arXiv preprint arXiv:2511.22699},
134
+ year={2025}
135
+ }
136
+
137
+ @article{decoupled-dmd2025,
138
+ title={Decoupled Consistency Model Distillation},
139
+ author={Liu et al.},
140
+ journal={arXiv preprint arXiv:2511.22677},
141
+ year={2025}
142
+ }
143
+
144
+ @article{dmdr2025,
145
+ title={DMDR: Fusing DMD with Reinforcement Learning},
146
+ author={Jiang et al.},
147
+ journal={arXiv preprint arXiv:2511.13649},
148
+ year={2025}
149
+ }
150
+ ```