olenet commited on
Commit
9aac6a2
·
verified ·
1 Parent(s): 60bd880

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -131
README.md CHANGED
@@ -1,97 +1,112 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
3
  ---
4
 
5
- # baidu/ERNIE-Image Model Cards
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
- ## Model Details
8
-
9
- ### Model Description
10
-
11
- **ERNIE-Image** is a text-to-image generation model developed by the ERNIE team at Baidu.
12
-
13
- In terms of image quality, ERNIE-Image is on par with current state-of-the-art models. It demonstrates significant advantages in handling complex instructions, particularly in tasks that require **accurate text rendering** and **knowledge-intensive generation**.
14
-
15
- ### Key Features
16
-
17
- * **Precise text rendering**: Especially strong in dense or complex text scenarios
18
- * **Excellent instruction following**: Accurately interprets and executes complex prompts
19
- * **High-quality portraits and stylized images**: Strong performance in both realism and artistic styles
20
-
21
- ### Model Architecture
22
-
23
- ERNIE-Image consists of the following components:
24
-
25
- * An 8B-parameter Diffusion Transformer (DiT)
26
- * A 3B text encoder from Ministral
27
- * A VAE based on flux2.dev
28
- * A prompt enhancer fine-tuned using Ministral 3B
29
-
30
- ### Deployment
31
-
32
- Thanks to its relatively compact model size, ERNIE-Image can be deployed on **consumer-grade GPUs (e.g., 24GB VRAM)**, making high-quality image generation more accessible and practical.
33
-
34
- ## Evaluation
35
-
36
- ### Benchmark
37
-
38
-
39
- ### Showcase
40
- <style>
41
- .masonry {
42
- column-count: 2; /* 两列 */
43
- column-gap: 12px;
44
- }
45
-
46
- .card {
47
- break-inside: avoid; /* 防止卡片被切断 */
48
- margin-bottom: 12px;
49
- }
50
-
51
- .card img {
52
- width: 100%; /* 宽度统一 */
53
- height: auto; /* 高度自适应(关键) */
54
- display: block;
55
- border-radius: 8px;
56
- }
57
- </style>
58
-
59
- <section id="shape-3160722256177163633" class="tab-panel active">
60
- <div class="masonry">
61
- <article class="card">
62
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6358acfe0e4fef21982a929b/ukAjGbYZwG4jPRRs9UgJ0.jpeg">
63
- </article>
64
- <article class="card">
65
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6358acfe0e4fef21982a929b/nD1aI60oXzAuGnhBA0P8T.jpeg">
66
- </article>
67
- <article class="card">
68
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6358acfe0e4fef21982a929b/YveUuXuSIiBWDs-U_mYqI.jpeg">
69
- </article>
70
- <article class="card">
71
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6358acfe0e4fef21982a929b/_0RGqPUeIB0r5SigsPgjz.jpeg">
72
- </article>
73
- <article class="card">
74
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6358acfe0e4fef21982a929b/RLQonbN1cRg4cHyx3GnCS.jpeg">
75
- </article>
76
- <article class="card">
77
- <img src="https://cdn-uploads.huggingface.co/production/uploads/6358acfe0e4fef21982a929b/C8NxWvTeC0-tqZuBmVPMT.jpeg">
78
- </article>
79
-
80
- </div>
81
- </section>
82
-
83
- ## Uses
84
-
85
- ### Installation & Download
86
- Install the latest version of diffusers:
87
- ```
88
- pip install git+https://github.com/huggingface/diffusers
89
- ```
90
- Download the model:
91
- ```
92
- pip install -U huggingface_hub
93
- HF_XET_HIGH_PERFORMANCE=1 hf download baidu/ERNIE-Image
94
- ```
95
  ### Recommended Parameters
96
  - Resolution:
97
  - 1024x1024
@@ -103,53 +118,58 @@ HF_XET_HIGH_PERFORMANCE=1 hf download baidu/ERNIE-Image
103
  - 1200x896
104
  - Guidance scale: 4.0
105
  - Inference steps: 50
106
-
107
- ### Usage Example
108
- ```
109
- import os
110
- os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
111
- import random
112
- import numpy as np
113
  import torch
114
  from diffusers import ErnieImagePipeline
115
 
116
- seed = 42
117
- print(f"seed: {seed}")
118
- random.seed(seed)
119
- np.random.seed(seed)
120
- torch.manual_seed(seed)
121
- torch.cuda.manual_seed_all(seed)
122
- torch.backends.cudnn.deterministic = True
123
- torch.use_deterministic_algorithms(True)
124
- torch.backends.cudnn.benchmark = False
125
-
126
- # 加载 pipeline
127
  pipe = ErnieImagePipeline.from_pretrained(
128
- "baidu/ERNIE-Image",
129
  torch_dtype=torch.bfloat16,
130
- )
131
- pipe = pipe.to("cuda")
132
- pipe.transformer.eval()
133
- pipe.vae.eval()
134
- pipe.text_encoder.eval()
135
- pipe.pe.eval()
136
-
137
- # 设置随机种子
138
- generator = torch.Generator(device="cuda").manual_seed(seed)
139
- # 生成图片
140
- output = pipe(
141
- prompt=prompt,
142
  height=1024,
143
  width=1024,
144
  num_inference_steps=50,
145
- guidance_scale=5.0,
146
- generator=generator,
147
- num_images_per_prompt=1,
148
- use_pe=True
149
- )
150
-
151
- revised_prompt = output.revised_prompts
152
- images = output.images
153
- image.save(f"./hf_output_0.png")
154
- print(revised_prompt)
155
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-to-image
4
+ library_name: diffusers
5
+ tags:
6
+ - text-to-image
7
  ---
8
 
9
+ # ERNIE-Image
10
+
11
+ <p align="center">
12
+ <img src="mosaic.jpg" alt="ERNIE-Image Mosaic" width="60%">
13
+ </p>
14
+
15
+ <p align="center">
16
+ <a href="https://huggingface.co/Baidu/ERNIE-Image">🤗 ERNIE-Image</a> &nbsp;|&nbsp;
17
+ <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">🤗 ERNIE-Image-Turbo</a> &nbsp;|&nbsp;
18
+ <a href="TODO">💻 GitHub</a> &nbsp;|&nbsp;
19
+ <a href="TODO">📖 Blog</a> &nbsp;|&nbsp;
20
+ <a href="TODO">🖼️ Gallery</a>
21
+ </p>
22
+
23
+ ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT) and paired with a lightweight Prompt Enhancer that expands brief user inputs into richer structured descriptions. With only 8B DiT parameters, it reaches state-of-the-art performance among open-weight text-to-image models. The model is designed not only for strong visual quality, but also for controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image performs strongly on complex instruction following, text rendering, and structured image generation, making it well suited for commercial posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and precise control. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and more stylized aesthetic outputs.
24
+
25
+ **Highlights:**
26
+ - **Compact but strong**: Despite its compact 8B scale, ERNIE-Image remains highly competitive with substantially larger open-weight models across a range of benchmarks.
27
+ - **Text rendering**: ERNIE-Image performs particularly well on dense, long-form, and layout-sensitive text, making it a strong choice for posters, infographics, UI-like images, and other text-heavy visual content.
28
+ - **Instruction following**: The model is able to follow complex prompts involving multiple objects, detailed relationships, and knowledge-intensive descriptions with strong reliability.
29
+ - **Structured generation**: ERNIE-Image is especially effective for structured visual tasks such as posters, comics, storyboards, and multi-panel compositions, where layout and organization are critical.
30
+ - **Style coverage**: In addition to clean and readable design-oriented outputs, the model also supports realistic photography and distinctive stylized aesthetics, including softer and more cinematic visual tones.
31
+ - **Practical deployment**: Thanks to its compact size, ERNIE-Image can run on consumer GPUs with 24G VRAM, which lowers the barrier for research, downstream use, and model adaptation.
32
+
33
+ ## Released Versions
34
+
35
+ ### [ERNIE-Image](https://huggingface.co/Baidu/ERNIE-Image): Our **SFT model**, delivers stronger general-purpose capability and instruction fidelity in typically **50 inference steps**.
36
+
37
+ ### [ERNIE-Image-Turbo](https://huggingface.co/Baidu/ERNIE-Image-Turbo): Our **Turbo model**, optimized by **DMD and RL**, achieves faster speed and higher aesthetics in only **8 inference steps**.
38
+
39
+ ## Benchmark
40
+
41
+ ### GENEval
42
+
43
+ | Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall |
44
+ |---|---:|---:|---:|---:|---:|---:|---:|
45
+ | ERNIE-Image (w/o PE) | **1.0000** | 0.9596 | 0.7781 | 0.9282 | 0.8550 | **0.7925** | **0.8856** |
46
+ | ERNIE-Image (w/ PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | **0.8625** | 0.7225 | 0.8728 |
47
+ | Qwen-Image | 0.9900 | 0.9200 | **0.8900** | 0.8800 | 0.7600 | 0.7700 | 0.8683 |
48
+ | ERNIE-Image-Turbo (w/o PE) | **1.0000** | **0.9621** | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 |
49
+ | ERNIE-Image-Turbo (w/ PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 |
50
+ | FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 |
51
+ | Z-Image | **1.0000** | 0.9400 | 0.7800 | **0.9300** | 0.6200 | 0.7700 | 0.8400 |
52
+ | Z-Image-Turbo | **1.0000** | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |
53
+
54
+ ### OneIG-EN
55
+
56
+ | Model | Alignment | Text | Reasoning | Style | Diversity | Overall |
57
+ |---|---:|---:|---:|---:|---:|---:|
58
+ | Nano Banana 2.0 | 0.8880 | 0.9440 | 0.3340 | **0.4810** | **0.2450** | **0.5780** |
59
+ | Seedream 4.5 | 0.8910 | **0.9980** | 0.3500 | 0.4340 | 0.2070 | 0.5760 |
60
+ | ERNIE-Image (w/ PE) | 0.8678 | 0.9788 | **0.3566** | 0.4309 | 0.2411 | 0.5750 |
61
+ | Seedream 4.0 | **0.8920** | 0.9830 | 0.3470 | 0.4530 | 0.1910 | 0.5730 |
62
+ | ERNIE-Image-Turbo (w/ PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 |
63
+ | ERNIE-Image (w/o PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 |
64
+ | Z-Image | 0.8810 | 0.9870 | 0.2800 | 0.3870 | 0.1940 | 0.5460 |
65
+ | Qwen-Image | 0.8820 | 0.8910 | 0.3060 | 0.4180 | 0.1970 | 0.5390 |
66
+ | ERNIE-Image-Turbo (w/o PE) | 0.8795 | 0.9488 | 0.2913 | 0.4277 | 0.1232 | 0.5341 |
67
+ | FLUX.2-klein-9B | 0.8871 | 0.8657 | 0.3117 | 0.4417 | 0.1560 | 0.5324 |
68
+ | Qwen-Image-2512 | 0.8760 | 0.9900 | 0.2920 | 0.3380 | 0.1510 | 0.5300 |
69
+ | GLM-Image | 0.8050 | 0.9690 | 0.2980 | 0.3530 | 0.2130 | 0.5280 |
70
+ | Z-Image-Turbo | 0.8400 | 0.9940 | 0.2980 | 0.3680 | 0.1390 | 0.5280 |
71
+
72
+ ### OneIG-ZH
73
+
74
+ | Model | Alignment | Text | Reasoning | Style | Diversity | Overall |
75
+ |---|---:|---:|---:|---:|---:|---:|
76
+ | Nano Banana 2.0 | **0.8430** | 0.9830 | **0.3110** | **0.4610** | 0.2360 | **0.5670** |
77
+ | ERNIE-Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 |
78
+ | Seedream 4.0 | 0.8360 | 0.9860 | 0.3040 | 0.4430 | 0.2000 | 0.5540 |
79
+ | Seedream 4.5 | 0.8320 | 0.9860 | 0.3000 | 0.4260 | 0.2130 | 0.5510 |
80
+ | Qwen-Image | 0.8250 | 0.9630 | 0.2670 | 0.4050 | **0.2790** | 0.5480 |
81
+ | ERNIE-Image-Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 |
82
+ | Z-Image | 0.7930 | **0.9880** | 0.2660 | 0.3860 | 0.2430 | 0.5350 |
83
+ | ERNIE-Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 |
84
+ | Qwen-Image-2512 | 0.8230 | 0.9830 | 0.2720 | 0.3420 | 0.1570 | 0.5150 |
85
+ | GLM-Image | 0.7380 | 0.9760 | 0.2840 | 0.3350 | 0.2210 | 0.5110 |
86
+ | Z-Image-Turbo | 0.7820 | 0.9820 | 0.2760 | 0.3610 | 0.1340 | 0.5070 |
87
+ | ERNIE-Image-Turbo (w/o PE) | 0.8326 | 0.9086 | 0.2580 | 0.4002 | 0.1316 | 0.5062 |
88
+ | FLUX.2-klein-9B | 0.8201 | 0.4920 | 0.2599 | 0.4166 | 0.1625 | 0.4302 |
89
+
90
+ ### LongTextBench
91
+
92
+ | Model | LongText-Bench-EN | LongText-Bench-ZH | Avg |
93
+ |---|---:|---:|---:|
94
+ | Seedream 4.5 | **0.9890** | **0.9873** | **0.9882** |
95
+ | ERNIE-Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 |
96
+ | GLM-Image | 0.9524 | 0.9788 | 0.9656 |
97
+ | ERNIE-Image-Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 |
98
+ | Nano Banana 2.0 | 0.9808 | 0.9491 | 0.9650 |
99
+ | ERNIE-Image-Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 |
100
+ | ERNIE-Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 |
101
+ | Qwen-Image-2512 | 0.9561 | 0.9647 | 0.9604 |
102
+ | Qwen-Image | 0.9430 | 0.9460 | 0.9445 |
103
+ | Z-Image | 0.9350 | 0.9360 | 0.9355 |
104
+ | Seedream 4.0 | 0.9214 | 0.9261 | 0.9238 |
105
+ | Z-Image-Turbo | 0.9170 | 0.9260 | 0.9215 |
106
+ | FLUX.2-klein-9B | 0.8642 | 0.2183 | 0.5413 |
107
+
108
+ ## Quick Start
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ### Recommended Parameters
111
  - Resolution:
112
  - 1024x1024
 
118
  - 1200x896
119
  - Guidance scale: 4.0
120
  - Inference steps: 50
121
+
122
+ ### Diffusers
123
+
124
+ `pip install git+https://github.com/huggingface/diffusers`
125
+
126
+ ```python
 
127
  import torch
128
  from diffusers import ErnieImagePipeline
129
 
 
 
 
 
 
 
 
 
 
 
 
130
  pipe = ErnieImagePipeline.from_pretrained(
131
+ "Baidu/ERNIE-Image",
132
  torch_dtype=torch.bfloat16,
133
+ ).to("cuda")
134
+
135
+ image = pipe(
136
+ prompt="A cinematic movie poster of a futuristic city at night with clear neon signage.",
 
 
 
 
 
 
 
 
137
  height=1024,
138
  width=1024,
139
  num_inference_steps=50,
140
+ guidance_scale=4.0,
141
+ use_pe=True # use prompt enhancer
142
+ ).images[0]
143
+
144
+ image.save("output.png")
145
+ ```
146
+
147
+ ### SGLang
148
+
149
+ Install the latest version of sglang:
150
+ ```
151
+ git clone https://github.com/sgl-project/sglang.git
152
+ ```
153
+
154
+ Start the server:
155
+
156
+ ```bash
157
+ sglang serve --model-path baidu/ERNIE-Image
158
+ ```
159
+
160
+ Send a generation request:
161
+
162
+ ```bash
163
+ curl -X POST http://localhost:30000/generate \
164
+ -H "Content-Type: application/json" \
165
+ -d '{
166
+ "prompt": "一只黑白相间的中华田园犬",
167
+ "height": 1024,
168
+ "width": 1024,
169
+ "num_inference_steps": 50,
170
+ "guidance_scale": 4.0,
171
+ "use_pe": true
172
+
173
+ }' \
174
+ --output output.png
175
+ ```