Update README.md
Browse files
README.md
CHANGED
|
@@ -13,32 +13,44 @@ pipeline_tag: text-to-image
|
|
| 13 |
<img src=https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/logo.svg width="40%"/>
|
| 14 |
</div>
|
| 15 |
<p align="center">
|
| 16 |
-
π Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/
|
| 17 |
<br>
|
| 18 |
π Check out GLM-Image's <a href="https://z.ai/blog/glm-image" target="_blank">Technical Blog</a>
|
| 19 |
<br>
|
| 20 |
π Use GLM-Image's <a href="https://docs.z.ai/guides/image/glm-image" target="_blank">API</a>
|
| 21 |
</p>
|
| 22 |
|
| 23 |
-
GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios.
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
| 34 |
+ Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.
|
| 35 |
|
| 36 |
Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.
|
| 37 |
|
| 38 |
+ Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
|
| 39 |
-
+ Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures
|
| 40 |
|
| 41 |
-
GLM-Image supports both text-to-image and image-to-image generation within a single model
|
| 42 |
|
| 43 |
+ Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.
|
| 44 |
+ Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.
|
|
@@ -88,8 +100,8 @@ image = Image.open(image_path).convert("RGB")
|
|
| 88 |
image = pipe(
|
| 89 |
prompt=prompt,
|
| 90 |
image=[image], # can input multiple images for multi-image-to-image generation such as [image, image1]
|
| 91 |
-
height=33 * 32,
|
| 92 |
-
width=32 * 32,
|
| 93 |
num_inference_steps=30,
|
| 94 |
guidance_scale=1.5,
|
| 95 |
generator=torch.Generator(device="cuda").manual_seed(42),
|
|
@@ -98,45 +110,333 @@ image = pipe(
|
|
| 98 |
image.save("output_i2i.png")
|
| 99 |
```
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
###
|
| 104 |
|
| 105 |
-
We use GLM-4.7 to
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
## Model Performance
|
| 108 |
|
| 109 |
### Text Rendering
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
<img src=https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/logo.svg width="40%"/>
|
| 14 |
</div>
|
| 15 |
<p align="center">
|
| 16 |
+
π Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/8KFjEec7" target="_blank">Discord</a> community
|
| 17 |
<br>
|
| 18 |
π Check out GLM-Image's <a href="https://z.ai/blog/glm-image" target="_blank">Technical Blog</a>
|
| 19 |
<br>
|
| 20 |
π Use GLM-Image's <a href="https://docs.z.ai/guides/image/glm-image" target="_blank">API</a>
|
| 21 |
</p>
|
| 22 |
|
|
|
|
| 23 |
|
| 24 |
+
## Case
|
| 25 |
+
|
| 26 |
+

|
| 27 |
+
|
| 28 |
+
### T2I with dense text and knowledge
|
| 29 |
+
|
| 30 |
+

|
| 31 |
+
|
| 32 |
+
### I2I
|
| 33 |
+
|
| 34 |
+

|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Introduction
|
| 38 |
|
| 39 |
+
GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture. In general image generation quality, GLMβImage aligns with mainstream latent diffusion approaches, but it shows significant advantages in text-rendering and knowledgeβintensive generation scenarios. It performs especially well in tasks requiring precise semantic understanding and complex information expression, while maintaining strong capabilities in highβfidelity and fineβgrained detail generation. In addition to textβtoβimage generation, GLMβImage also supports a rich set of imageβtoβimage tasks including image editing, style transfer, identityβpreserving generation, and multiβsubject consistency.
|
| 40 |
|
| 41 |
+
Model architecture: a hybrid autoregressive + diffusion decoder design.
|
| 42 |
+
|
| 43 |
+

|
| 44 |
+
|
| 45 |
+
+ Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1Kβ4K tokens, corresponding to 1Kβ2K high-resolution image outputs.
|
| 46 |
+ Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.
|
| 47 |
|
| 48 |
Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.
|
| 49 |
|
| 50 |
+ Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
|
| 51 |
+
+ Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures as well as more precise text rendering.
|
| 52 |
|
| 53 |
+
GLM-Image supports both text-to-image and image-to-image generation within a single model.
|
| 54 |
|
| 55 |
+ Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.
|
| 56 |
+ Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.
|
|
|
|
| 100 |
image = pipe(
|
| 101 |
prompt=prompt,
|
| 102 |
image=[image], # can input multiple images for multi-image-to-image generation such as [image, image1]
|
| 103 |
+
height=33 * 32, # Must set height even it is same as input image
|
| 104 |
+
width=32 * 32, # Must set width even it is same as input image
|
| 105 |
num_inference_steps=30,
|
| 106 |
guidance_scale=1.5,
|
| 107 |
generator=torch.Generator(device="cuda").manual_seed(42),
|
|
|
|
| 110 |
image.save("output_i2i.png")
|
| 111 |
```
|
| 112 |
|
| 113 |
+
### SGLang Pipeline
|
| 114 |
+
|
| 115 |
+
Install transformers and diffusers from source:
|
| 116 |
+
|
| 117 |
+
```
|
| 118 |
+
pip install "sglang[diffusion] @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
|
| 119 |
+
pip install git+https://github.com/huggingface/transformers.git
|
| 120 |
+
pip install git+https://github.com/huggingface/diffusers.git
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
+ Text to Image Generation
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
sglang serve --model-path zai-org/GLM-Image
|
| 127 |
+
|
| 128 |
+
curl http://localhost:30000/v1/images/generations \
|
| 129 |
+
-H "Content-Type: application/json" \
|
| 130 |
+
-d '{
|
| 131 |
+
"model": "zai-org/GLM-Image",
|
| 132 |
+
"prompt": "Doraemon is flying in the sky.",
|
| 133 |
+
"n": 1,
|
| 134 |
+
"response_format": "b64_json",
|
| 135 |
+
"size": "1024x1024"
|
| 136 |
+
}' | python3 -c "import sys, json, base64; open('output_t2i.png', 'wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
+ Image to Image Generation
|
| 140 |
+
|
| 141 |
+
```
|
| 142 |
+
sglang serve --model-path zai-org/GLM-Image
|
| 143 |
+
|
| 144 |
+
curl -s -X POST "http://localhost:30000/v1/images/edits" \
|
| 145 |
+
-F "model=zai-org/GLM-Image" \
|
| 146 |
+
-F "image=@cond.jpg" \
|
| 147 |
+
-F "prompt=Replace the background of the snow forest with an underground station featuring an automatic escalator." \
|
| 148 |
+
-F "response_format=b64_json" | python3 -c "import sys, json, base64; open('output_i2i.png', 'wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"
|
| 149 |
+
```
|
| 150 |
|
| 151 |
+
### Note
|
| 152 |
|
| 153 |
+
+ We strongly recommend to use GLM-4.7 to enhance prompts for higher image quality, Please check [our github script](https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/examples/prompt_utils.py) for more details.
|
| 154 |
+
+ The AR model used in GLMβImage is configured with `do_sample=True`, a temperature of `0.9`, and a topp of `0.75` by default. A higher temperature results in more diverse and rich outputs, but it can also lead to a certain decrease in output stability.
|
| 155 |
+
+ The target image resolution must be divisible by 32. Otherwise, it will throw an error.
|
| 156 |
+
+ Because the inference optimizations for this architecture are currently limited, the runtime cost is still relatively high. It requires either a single GPU with more than 80GB of memory, or a multi-GPU setup.
|
| 157 |
+
+ vLLM-Omni and SGLang (with AR speedup) support is currently being integrated β stay tuned. For inference cost, you can check in our github.
|
| 158 |
|
| 159 |
## Model Performance
|
| 160 |
|
| 161 |
### Text Rendering
|
| 162 |
|
| 163 |
+
<div style="overflow-x: auto; margin-bottom: 16px;">
|
| 164 |
+
<table style="border-collapse: collapse; width: 100%;">
|
| 165 |
+
<thead>
|
| 166 |
+
<tr>
|
| 167 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;" rowspan="2">Model</th>
|
| 168 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;" rowspan="2">Open Source</th>
|
| 169 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;" colspan="3">CVTG-2K</th>
|
| 170 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;" colspan="3">LongText-Bench</th>
|
| 171 |
+
</tr>
|
| 172 |
+
<tr>
|
| 173 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">Word Accuracy</th>
|
| 174 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">NED</th>
|
| 175 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">CLIPScore</th>
|
| 176 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">AVG</th>
|
| 177 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">EN</th>
|
| 178 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">ZH</th>
|
| 179 |
+
</tr>
|
| 180 |
+
</thead>
|
| 181 |
+
<tbody>
|
| 182 |
+
<tr>
|
| 183 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Seedream 4.5</td>
|
| 184 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 185 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8990</td>
|
| 186 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9483</td>
|
| 187 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.8069</strong></td>
|
| 188 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.988</strong></td>
|
| 189 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.989</strong></td>
|
| 190 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.987</strong></td>
|
| 191 |
+
</tr>
|
| 192 |
+
<tr>
|
| 193 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Seedream 4.0</td>
|
| 194 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 195 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8451</td>
|
| 196 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9224</td>
|
| 197 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7975</td>
|
| 198 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.924</td>
|
| 199 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.921</td>
|
| 200 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.926</td>
|
| 201 |
+
</tr>
|
| 202 |
+
<tr>
|
| 203 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Nano Banana 2.0</td>
|
| 204 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 205 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7788</td>
|
| 206 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8754</td>
|
| 207 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7372</td>
|
| 208 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.965</td>
|
| 209 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.981</td>
|
| 210 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.949</td>
|
| 211 |
+
</tr>
|
| 212 |
+
<tr>
|
| 213 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">GPT Image 1 [High]</td>
|
| 214 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 215 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8569</td>
|
| 216 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9478</td>
|
| 217 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7982</td>
|
| 218 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.788</td>
|
| 219 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.956</td>
|
| 220 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.619</td>
|
| 221 |
+
</tr>
|
| 222 |
+
<tr>
|
| 223 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Qwen-Image</td>
|
| 224 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 225 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8288</td>
|
| 226 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9116</td>
|
| 227 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8017</td>
|
| 228 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.945</td>
|
| 229 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.943</td>
|
| 230 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.946</td>
|
| 231 |
+
</tr>
|
| 232 |
+
<tr>
|
| 233 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Qwen-Image-2512</td>
|
| 234 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 235 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8604</td>
|
| 236 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9290</td>
|
| 237 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7819</td>
|
| 238 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.961</td>
|
| 239 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.956</td>
|
| 240 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.965</td>
|
| 241 |
+
</tr>
|
| 242 |
+
<tr>
|
| 243 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Z-Image</td>
|
| 244 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 245 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8671</td>
|
| 246 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9367</td>
|
| 247 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7969</td>
|
| 248 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.936</td>
|
| 249 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.935</td>
|
| 250 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.936</td>
|
| 251 |
+
</tr>
|
| 252 |
+
<tr>
|
| 253 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;">Z-Image-Turbo</td>
|
| 254 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 255 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8585</td>
|
| 256 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.9281</td>
|
| 257 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.8048</td>
|
| 258 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.922</td>
|
| 259 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.917</td>
|
| 260 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.926</td>
|
| 261 |
+
</tr>
|
| 262 |
+
<tr>
|
| 263 |
+
<td style="padding: 8px; border: 1px solid #d0d7de;white-space:nowrap;"><strong>GLM-Image</strong></td>
|
| 264 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 265 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.9116</strong></td>
|
| 266 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.9557</strong></td>
|
| 267 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.7877</td>
|
| 268 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.966</td>
|
| 269 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.952</td>
|
| 270 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.979</td>
|
| 271 |
+
</tr>
|
| 272 |
+
</tbody>
|
| 273 |
+
</table>
|
| 274 |
+
</div>
|
| 275 |
+
|
| 276 |
+
### Text-to-Image
|
| 277 |
+
|
| 278 |
+
<div style="overflow-x: auto; margin-bottom: 16px;">
|
| 279 |
+
<table style="border-collapse: collapse; width: 100%;">
|
| 280 |
+
<thead>
|
| 281 |
+
<tr>
|
| 282 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;" rowspan="2">Model</th>
|
| 283 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;" rowspan="2">Open Source</th>
|
| 284 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;" colspan="2">OneIG-Bench</th>
|
| 285 |
+
<th style="padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;" colspan="2">TIIF-Bench</th>
|
| 286 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa;" rowspan="2">DPG-Bench</th>
|
| 287 |
+
</tr>
|
| 288 |
+
<tr>
|
| 289 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">EN</th>
|
| 290 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">ZH</th>
|
| 291 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">short</th>
|
| 292 |
+
<th style="white-space: nowrap; padding: 8px; border: 1px solid #d0d7de; background-color: #f6f8fa; text-align: center;">long</th>
|
| 293 |
+
</tr>
|
| 294 |
+
</thead>
|
| 295 |
+
<tbody>
|
| 296 |
+
<tr>
|
| 297 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Seedream 4.5</td>
|
| 298 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 299 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.576</td>
|
| 300 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.551</td>
|
| 301 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">90.49</td>
|
| 302 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>88.52</strong></td>
|
| 303 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>88.63</strong></td>
|
| 304 |
+
</tr>
|
| 305 |
+
<tr>
|
| 306 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Seedream 4.0</td>
|
| 307 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 308 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.576</td>
|
| 309 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.553</td>
|
| 310 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">90.45</td>
|
| 311 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">88.08</td>
|
| 312 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">88.54</td>
|
| 313 |
+
</tr>
|
| 314 |
+
<tr>
|
| 315 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Nano Banana 2.0</td>
|
| 316 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 317 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.578</strong></td>
|
| 318 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>0.567</strong></td>
|
| 319 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;"><strong>91.00</strong></td>
|
| 320 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">88.26</td>
|
| 321 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">87.16</td>
|
| 322 |
+
</tr>
|
| 323 |
+
<tr>
|
| 324 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">GPT Image 1 [High]</td>
|
| 325 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 326 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.533</td>
|
| 327 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.474</td>
|
| 328 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">89.15</td>
|
| 329 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">88.29</td>
|
| 330 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">85.15</td>
|
| 331 |
+
</tr>
|
| 332 |
+
<tr>
|
| 333 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">DALL-E 3</td>
|
| 334 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 335 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 336 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 337 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">74.96</td>
|
| 338 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">70.81</td>
|
| 339 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">83.50</td>
|
| 340 |
+
</tr>
|
| 341 |
+
<tr>
|
| 342 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Qwen-Image</td>
|
| 343 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 344 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.539</td>
|
| 345 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.548</td>
|
| 346 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">86.14</td>
|
| 347 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">86.83</td>
|
| 348 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">88.32</td>
|
| 349 |
+
</tr>
|
| 350 |
+
<tr>
|
| 351 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Qwen-Image-2512</td>
|
| 352 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 353 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.530</td>
|
| 354 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.515</td>
|
| 355 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">83.24</td>
|
| 356 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">84.93</td>
|
| 357 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">87.20</td>
|
| 358 |
+
</tr>
|
| 359 |
+
<tr>
|
| 360 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Z-Image</td>
|
| 361 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 362 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.546</td>
|
| 363 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.535</td>
|
| 364 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">80.20</td>
|
| 365 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">83.01</td>
|
| 366 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">88.14</td>
|
| 367 |
+
</tr>
|
| 368 |
+
<tr>
|
| 369 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Z-Image-Turbo</td>
|
| 370 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 371 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.528</td>
|
| 372 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.507</td>
|
| 373 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">77.73</td>
|
| 374 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">80.05</td>
|
| 375 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">84.86</td>
|
| 376 |
+
</tr>
|
| 377 |
+
<tr>
|
| 378 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">FLUX.1 [Dev]</td>
|
| 379 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 380 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.434</td>
|
| 381 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 382 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">71.09</td>
|
| 383 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">71.78</td>
|
| 384 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">83.52</td>
|
| 385 |
+
</tr>
|
| 386 |
+
<tr>
|
| 387 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">SD3 Medium</td>
|
| 388 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 389 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 390 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 391 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">67.46</td>
|
| 392 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">66.09</td>
|
| 393 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">84.08</td>
|
| 394 |
+
</tr>
|
| 395 |
+
<tr>
|
| 396 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">SD XL</td>
|
| 397 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 398 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.316</td>
|
| 399 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 400 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">54.96</td>
|
| 401 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">42.13</td>
|
| 402 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">74.65</td>
|
| 403 |
+
</tr>
|
| 404 |
+
<tr>
|
| 405 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">BAGEL</td>
|
| 406 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 407 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.361</td>
|
| 408 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.370</td>
|
| 409 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">71.50</td>
|
| 410 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">71.70</td>
|
| 411 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 412 |
+
</tr>
|
| 413 |
+
<tr>
|
| 414 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Janus-Pro</td>
|
| 415 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 416 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.267</td>
|
| 417 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.240</td>
|
| 418 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">66.50</td>
|
| 419 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">65.01</td>
|
| 420 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">84.19</td>
|
| 421 |
+
</tr>
|
| 422 |
+
<tr>
|
| 423 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;">Show-o2</td>
|
| 424 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 425 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.308</td>
|
| 426 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 427 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">59.72</td>
|
| 428 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">58.86</td>
|
| 429 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">-</td>
|
| 430 |
+
</tr>
|
| 431 |
+
<tr>
|
| 432 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; white-space:nowrap;font-weight:bold;">GLM-Image</td>
|
| 433 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">β</td>
|
| 434 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.528</td>
|
| 435 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">0.511</td>
|
| 436 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">81.01</td>
|
| 437 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">81.02</td>
|
| 438 |
+
<td style="padding: 8px; border: 1px solid #d0d7de; text-align: center;">84.78</td>
|
| 439 |
+
</tr>
|
| 440 |
+
</tbody>
|
| 441 |
+
</table>
|
| 442 |
+
</div>
|