File size: 11,154 Bytes
308469a
 
02be901
 
 
fca86e1
 
308469a
 
915614c
 
49eda47
02be901
 
 
d2f1f8f
54f8a75
d2f1f8f
f24064d
f0558bb
 
83adb9f
6016754
83adb9f
 
 
c66dbdc
83adb9f
 
02be901
915614c
 
02be901
49eda47
02be901
 
49eda47
915614c
 
 
 
 
 
 
 
 
 
dbf08d6
915614c
dbf08d6
915614c
 
 
49eda47
915614c
 
 
02be901
 
 
 
 
915614c
02be901
915614c
 
 
 
 
 
02be901
 
 
 
 
 
915614c
 
02be901
 
915614c
 
02be901
915614c
 
 
 
 
02be901
 
 
 
915614c
02be901
915614c
02be901
915614c
 
 
02be901
915614c
 
 
 
 
 
02be901
 
 
 
 
 
 
915614c
 
 
02be901
915614c
 
 
 
 
1589d2b
 
 
 
 
 
 
 
 
 
 
 
 
 
915614c
 
308469a
 
 
 
915614c
 
 
308469a
 
 
 
915614c
308469a
915614c
 
 
49eda47
 
 
308469a
 
915614c
 
 
 
 
 
1589d2b
 
 
915614c
 
d257b0f
 
 
 
02be901
915614c
 
 
 
 
 
 
 
 
bc68c81
915614c
 
49eda47
 
 
915614c
 
 
 
 
fca86e1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: apache-2.0
pipeline_tag: text-to-image
library_name: diffusers
tags:
- text-to-image
- 8B
---

# ERNIE-Image-Turbo


<p align="center">
  <a href="https://huggingface.co/Baidu/ERNIE-Image">🤗 ERNIE-Image</a> &nbsp;|&nbsp;
  <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">🤗 ERNIE-Image-Turbo</a> &nbsp;|&nbsp;
  <a href="https://www.modelscope.cn/models/PaddlePaddle/ERNIE-Image/summary">🤖 ERNIE-Image</a> &nbsp;|&nbsp;  
  <a href="https://www.modelscope.cn/models/PaddlePaddle/ERNIE-Image-Turbo/summary">🤖 ERNIE-Image-Turbo</a> &nbsp;
  <br/>
  <a href="https://huggingface.co/spaces/baidu/ERNIE-Image-Turbo">🖥️ Huggingface Demo1</a> &nbsp;|&nbsp;
  <a href="https://huggingface.co/spaces/akhaliq/ERNIE-Image-Turbo">🖥️ Huggingface Demo2(ZeroGPU)</a> &nbsp;|&nbsp;
  <a href="https://aistudio.baidu.com/ernieimage">🖥️ AI Studio Demo</a> &nbsp;&nbsp;  
  <br/>
  <a href="https://github.com/baidu/ernie-image">Github</a> &nbsp;|&nbsp;
  <a href="https://yiyan.baidu.com/blog/posts/ernie-image">📖 Blog</a> &nbsp;|&nbsp;
  <a href="https://ernieimageprompt.com/">🖼️ Art Gallery</a>
  <br/>
  <a href="https://github.com/baidu/ERNIE-Image/blob/main/assets/contacts/WeChat_small.jpg">💬 WeChat(微信)</a> &nbsp;|&nbsp;
  <a href="https://discord.gg/ByUTbjfG5k">🫨 Discord</a> &nbsp;|&nbsp;
  <a href="https://x.com/ErnieforDevs">🏷️ X</a>
</p>
ERNIE-Image-Turbo is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is the distilled release of ERNIE-Image, built on the same single-stream Diffusion Transformer (DiT) family and designed for fast generation with strong fidelity in only 8 inference steps. The model retains strong controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image-Turbo remains strong on complex instruction following, text rendering, and structured image generation, making it well suited for posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and efficiency. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and stylized aesthetic outputs.

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/5f8d780e5d083370c711f575/QRt1mPSU9SCkcxxFWQje2.jpeg" alt="ERNIE-Image Mosaic" width="100%">
</p>


**Highlights:**
- **Fast and efficient**: As the distilled checkpoint of ERNIE-Image, ERNIE-Image-Turbo delivers strong generation quality with only 8 inference steps, making it suitable for latency-sensitive applications.
- **Text rendering**: ERNIE-Image-Turbo performs well on dense, long-form, and layout-sensitive text, making it a strong choice for posters, infographics, UI-like images, and other text-heavy visual content.
- **Instruction following**: The model is able to follow complex prompts involving multiple objects, detailed relationships, and knowledge-intensive descriptions with strong reliability.
- **Structured generation**: ERNIE-Image-Turbo is effective for structured visual tasks such as posters, comics, storyboards, and multi-panel compositions, where layout and organization are critical.
- **Style coverage**: In addition to clean and readable design-oriented outputs, the model also supports realistic photography and distinctive stylized aesthetics, including softer and more cinematic visual tones.
- **Practical deployment**: Thanks to its compact size, ERNIE-Image-Turbo can run on consumer GPUs with 24G VRAM, which lowers the barrier for research, downstream use, and model adaptation.

## Released Versions

[ERNIE-Image](https://huggingface.co/Baidu/ERNIE-Image): Our **SFT model**, delivers stronger general-purpose capability and instruction fidelity in typically **50 inference steps**.

[ERNIE-Image-Turbo](https://huggingface.co/Baidu/ERNIE-Image-Turbo): Our **Turbo model**, optimized by **DMD and RL**, achieves faster speed and higher aesthetics in only **8 inference steps**.

## Benchmark

### GENEval

| Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall |
|---|---:|---:|---:|---:|---:|---:|---:|
| ERNIE-Image (w/o PE) | **1.0000** | 0.9596 | 0.7781 | 0.9282 | 0.8550 | **0.7925** | **0.8856** |
| ERNIE-Image (w/ PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | **0.8625** | 0.7225 | 0.8728 |
| Qwen-Image | 0.9900 | 0.9200 | **0.8900** | 0.8800 | 0.7600 | 0.7700 | 0.8683 |
| ERNIE-Image-Turbo (w/o PE) | **1.0000** | **0.9621** | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 |
| ERNIE-Image-Turbo (w/ PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 |
| FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 |
| Z-Image | **1.0000** | 0.9400 | 0.7800 | **0.9300** | 0.6200 | 0.7700 | 0.8400 |
| Z-Image-Turbo | **1.0000** | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |

### OneIG-EN

| Model | Alignment | Text | Reasoning | Style | Diversity | Overall |
|---|---:|---:|---:|---:|---:|---:|
| Nano Banana 2.0 | 0.8880 | 0.9440 | 0.3340 | **0.4810** | **0.2450** | **0.5780** |
| Seedream 4.5 | 0.8910 | **0.9980** | 0.3500 | 0.4340 | 0.2070 | 0.5760 |
| ERNIE-Image (w/ PE) | 0.8678 | 0.9788 | **0.3566** | 0.4309 | 0.2411 | 0.5750 |
| Seedream 4.0 | **0.8920** | 0.9830 | 0.3470 | 0.4530 | 0.1910 | 0.5730 |
| ERNIE-Image-Turbo (w/ PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 |
| ERNIE-Image (w/o PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 |
| Z-Image | 0.8810 | 0.9870 | 0.2800 | 0.3870 | 0.1940 | 0.5460 |
| Qwen-Image | 0.8820 | 0.8910 | 0.3060 | 0.4180 | 0.1970 | 0.5390 |
| ERNIE-Image-Turbo (w/o PE) | 0.8795 | 0.9488 | 0.2913 | 0.4277 | 0.1232 | 0.5341 |
| FLUX.2-klein-9B | 0.8871 | 0.8657 | 0.3117 | 0.4417 | 0.1560 | 0.5324 |
| Qwen-Image-2512 | 0.8760 | 0.9900 | 0.2920 | 0.3380 | 0.1510 | 0.5300 |
| GLM-Image | 0.8050 | 0.9690 | 0.2980 | 0.3530 | 0.2130 | 0.5280 |
| Z-Image-Turbo | 0.8400 | 0.9940 | 0.2980 | 0.3680 | 0.1390 | 0.5280 |

### OneIG-ZH

| Model | Alignment | Text | Reasoning | Style | Diversity | Overall |
|---|---:|---:|---:|---:|---:|---:|
| Nano Banana 2.0 | **0.8430** | 0.9830 | **0.3110** | **0.4610** | 0.2360 | **0.5670** |
| ERNIE-Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 |
| Seedream 4.0 | 0.8360 | 0.9860 | 0.3040 | 0.4430 | 0.2000 | 0.5540 |
| Seedream 4.5 | 0.8320 | 0.9860 | 0.3000 | 0.4260 | 0.2130 | 0.5510 |
| Qwen-Image | 0.8250 | 0.9630 | 0.2670 | 0.4050 | **0.2790** | 0.5480 |
| ERNIE-Image-Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 |
| Z-Image | 0.7930 | **0.9880** | 0.2660 | 0.3860 | 0.2430 | 0.5350 |
| ERNIE-Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 |
| Qwen-Image-2512 | 0.8230 | 0.9830 | 0.2720 | 0.3420 | 0.1570 | 0.5150 |
| GLM-Image | 0.7380 | 0.9760 | 0.2840 | 0.3350 | 0.2210 | 0.5110 |
| Z-Image-Turbo | 0.7820 | 0.9820 | 0.2760 | 0.3610 | 0.1340 | 0.5070 |
| ERNIE-Image-Turbo (w/o PE) | 0.8326 | 0.9086 | 0.2580 | 0.4002 | 0.1316 | 0.5062 |
| FLUX.2-klein-9B | 0.8201 | 0.4920 | 0.2599 | 0.4166 | 0.1625 | 0.4302 |

### LongTextBench

| Model | LongText-Bench-EN | LongText-Bench-ZH | Avg |
|---|---:|---:|---:|
| Seedream 4.5 | **0.9890** | **0.9873** | **0.9882** |
| ERNIE-Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 |
| GLM-Image | 0.9524 | 0.9788 | 0.9656 |
| ERNIE-Image-Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 |
| Nano Banana 2.0 | 0.9808 | 0.9491 | 0.9650 |
| ERNIE-Image-Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 |
| ERNIE-Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 |
| Qwen-Image-2512 | 0.9561 | 0.9647 | 0.9604 |
| Qwen-Image | 0.9430 | 0.9460 | 0.9445 |
| Z-Image | 0.9350 | 0.9360 | 0.9355 |
| Seedream 4.0 | 0.9214 | 0.9261 | 0.9238 |
| Z-Image-Turbo | 0.9170 | 0.9260 | 0.9215 |
| FLUX.2-klein-9B | 0.8642 | 0.2183 | 0.5413 |

## Quick Start


### Recommended Parameters
- Resolution: 
    - 1024x1024
    - 848x1264
    - 1264x848
    - 768x1376
    - 896x1200
    - 1376x768
    - 1200x896
- Guidance scale: 1.0
- Inference steps: 8


### Diffusers

Install the latest version of diffusers:
```
pip install git+https://github.com/huggingface/diffusers
```


```python
import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image-Turbo",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")
```




### SGLang

Install the latest version of sglang:
```
git clone https://github.com/sgl-project/sglang.git
```

Start the server:

```bash
sglang serve --model-path baidu/ERNIE-Image-Turbo
```

Send a generation request:

```bash
curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 8,
    "guidance_scale": 1.0,
    "use_pe": true
  }' \
  --output output.png
```