ovedrive commited on
Commit
26398f5
·
verified ·
1 Parent(s): c50702f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -108
README.md CHANGED
@@ -5,119 +5,21 @@ library_name: diffusers
5
  tags:
6
  - text-to-image
7
  - 8B
 
 
 
8
  model_size: 7B
 
 
 
 
9
  ---
10
 
11
  # ERNIE-Image
12
 
13
- <p align="center">
14
- <a href="https://huggingface.co/Baidu/ERNIE-Image">🤗 ERNIE-Image</a> &nbsp;|&nbsp;
15
- <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">🤗 ERNIE-Image-Turbo</a> &nbsp;|&nbsp;
16
- <a href="https://www.modelscope.cn/models/PaddlePaddle/ERNIE-Image/summary">🤖 ERNIE-Image</a> &nbsp;|&nbsp;
17
- <a href="https://www.modelscope.cn/models/PaddlePaddle/ERNIE-Image-Turbo/summary">🤖 ERNIE-Image-Turbo</a> &nbsp;
18
- <br/>
19
- <a href="https://huggingface.co/spaces/baidu/ERNIE-Image-Turbo">🖥️ Huggingface Demo1</a> &nbsp;|&nbsp;
20
- <a href="https://huggingface.co/spaces/akhaliq/ERNIE-Image-Turbo">🖥️ Huggingface Demo2(ZeroGPU)</a> &nbsp;|&nbsp;
21
- <a href="https://aistudio.baidu.com/ernieimage">🖥️ AI Studio Demo</a> &nbsp;&nbsp;
22
-
23
- <br/>
24
- <a href="https://github.com/baidu/ernie-image">Github</a> &nbsp;|&nbsp;
25
- <a href="https://yiyan.baidu.com/blog/posts/ernie-image">📖 Blog</a> &nbsp;|&nbsp;
26
- <a href="https://ernieimageprompt.com/">🖼️ Art Gallery</a>
27
- <br/>
28
- <a href="https://github.com/baidu/ERNIE-Image/blob/main/assets/contacts/WeChat_small.jpg">💬 WeChat(微信)</a> &nbsp;|&nbsp;
29
- <a href="https://discord.gg/ByUTbjfG5k">🫨 Discord</a> &nbsp;|&nbsp;
30
- <a href="https://x.com/ErnieforDevs">🏷️ X</a>
31
- </p>
32
-
33
- ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT) and paired with a lightweight Prompt Enhancer that expands brief user inputs into richer structured descriptions. With only 8B DiT parameters, it reaches state-of-the-art performance among open-weight text-to-image models. The model is designed not only for strong visual quality, but also for controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image performs strongly on complex instruction following, text rendering, and structured image generation, making it well suited for commercial posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and precise control. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and more stylized aesthetic outputs.
34
-
35
- <p align="center">
36
- <img src="https://cdn-uploads.huggingface.co/production/uploads/5f8d780e5d083370c711f575/QRt1mPSU9SCkcxxFWQje2.jpeg" alt="ERNIE-Image Mosaic" width="100%">
37
- </p>
38
-
39
- **Highlights:**
40
- - **Compact but strong**: Despite its compact 8B scale, ERNIE-Image remains highly competitive with substantially larger open-weight models across a range of benchmarks.
41
- - **Text rendering**: ERNIE-Image performs particularly well on dense, long-form, and layout-sensitive text, making it a strong choice for posters, infographics, UI-like images, and other text-heavy visual content.
42
- - **Instruction following**: The model is able to follow complex prompts involving multiple objects, detailed relationships, and knowledge-intensive descriptions with strong reliability.
43
- - **Structured generation**: ERNIE-Image is especially effective for structured visual tasks such as posters, comics, storyboards, and multi-panel compositions, where layout and organization are critical.
44
- - **Style coverage**: In addition to clean and readable design-oriented outputs, the model also supports realistic photography and distinctive stylized aesthetics, including softer and more cinematic visual tones.
45
- - **Practical deployment**: Thanks to its compact size, ERNIE-Image can run on consumer GPUs with 24G VRAM, which lowers the barrier for research, downstream use, and model adaptation.
46
-
47
- ## Released Versions
48
-
49
- [ERNIE-Image](https://huggingface.co/Baidu/ERNIE-Image): Our **SFT model**, delivers stronger general-purpose capability and instruction fidelity in typically **50 inference steps**.
50
-
51
- [ERNIE-Image-Turbo](https://huggingface.co/Baidu/ERNIE-Image-Turbo): Our **Turbo model**, optimized by **DMD and RL**, achieves faster speed and higher aesthetics in only **8 inference steps**.
52
-
53
- ## Benchmark
54
-
55
- ### GENEval
56
-
57
- | Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall |
58
- |---|---:|---:|---:|---:|---:|---:|---:|
59
- | ERNIE-Image (w/o PE) | **1.0000** | 0.9596 | 0.7781 | 0.9282 | 0.8550 | **0.7925** | **0.8856** |
60
- | ERNIE-Image (w/ PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | **0.8625** | 0.7225 | 0.8728 |
61
- | Qwen-Image | 0.9900 | 0.9200 | **0.8900** | 0.8800 | 0.7600 | 0.7700 | 0.8683 |
62
- | ERNIE-Image-Turbo (w/o PE) | **1.0000** | **0.9621** | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 |
63
- | ERNIE-Image-Turbo (w/ PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 |
64
- | FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 |
65
- | Z-Image | **1.0000** | 0.9400 | 0.7800 | **0.9300** | 0.6200 | 0.7700 | 0.8400 |
66
- | Z-Image-Turbo | **1.0000** | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |
67
-
68
- ### OneIG-EN
69
-
70
- | Model | Alignment | Text | Reasoning | Style | Diversity | Overall |
71
- |---|---:|---:|---:|---:|---:|---:|
72
- | Nano Banana 2.0 | 0.8880 | 0.9440 | 0.3340 | **0.4810** | **0.2450** | **0.5780** |
73
- | Seedream 4.5 | 0.8910 | **0.9980** | 0.3500 | 0.4340 | 0.2070 | 0.5760 |
74
- | ERNIE-Image (w/ PE) | 0.8678 | 0.9788 | **0.3566** | 0.4309 | 0.2411 | 0.5750 |
75
- | Seedream 4.0 | **0.8920** | 0.9830 | 0.3470 | 0.4530 | 0.1910 | 0.5730 |
76
- | ERNIE-Image-Turbo (w/ PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 |
77
- | ERNIE-Image (w/o PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 |
78
- | Z-Image | 0.8810 | 0.9870 | 0.2800 | 0.3870 | 0.1940 | 0.5460 |
79
- | Qwen-Image | 0.8820 | 0.8910 | 0.3060 | 0.4180 | 0.1970 | 0.5390 |
80
- | ERNIE-Image-Turbo (w/o PE) | 0.8795 | 0.9488 | 0.2913 | 0.4277 | 0.1232 | 0.5341 |
81
- | FLUX.2-klein-9B | 0.8871 | 0.8657 | 0.3117 | 0.4417 | 0.1560 | 0.5324 |
82
- | Qwen-Image-2512 | 0.8760 | 0.9900 | 0.2920 | 0.3380 | 0.1510 | 0.5300 |
83
- | GLM-Image | 0.8050 | 0.9690 | 0.2980 | 0.3530 | 0.2130 | 0.5280 |
84
- | Z-Image-Turbo | 0.8400 | 0.9940 | 0.2980 | 0.3680 | 0.1390 | 0.5280 |
85
-
86
- ### OneIG-ZH
87
-
88
- | Model | Alignment | Text | Reasoning | Style | Diversity | Overall |
89
- |---|---:|---:|---:|---:|---:|---:|
90
- | Nano Banana 2.0 | **0.8430** | 0.9830 | **0.3110** | **0.4610** | 0.2360 | **0.5670** |
91
- | ERNIE-Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 |
92
- | Seedream 4.0 | 0.8360 | 0.9860 | 0.3040 | 0.4430 | 0.2000 | 0.5540 |
93
- | Seedream 4.5 | 0.8320 | 0.9860 | 0.3000 | 0.4260 | 0.2130 | 0.5510 |
94
- | Qwen-Image | 0.8250 | 0.9630 | 0.2670 | 0.4050 | **0.2790** | 0.5480 |
95
- | ERNIE-Image-Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 |
96
- | Z-Image | 0.7930 | **0.9880** | 0.2660 | 0.3860 | 0.2430 | 0.5350 |
97
- | ERNIE-Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 |
98
- | Qwen-Image-2512 | 0.8230 | 0.9830 | 0.2720 | 0.3420 | 0.1570 | 0.5150 |
99
- | GLM-Image | 0.7380 | 0.9760 | 0.2840 | 0.3350 | 0.2210 | 0.5110 |
100
- | Z-Image-Turbo | 0.7820 | 0.9820 | 0.2760 | 0.3610 | 0.1340 | 0.5070 |
101
- | ERNIE-Image-Turbo (w/o PE) | 0.8326 | 0.9086 | 0.2580 | 0.4002 | 0.1316 | 0.5062 |
102
- | FLUX.2-klein-9B | 0.8201 | 0.4920 | 0.2599 | 0.4166 | 0.1625 | 0.4302 |
103
-
104
- ### LongTextBench
105
-
106
- | Model | LongText-Bench-EN | LongText-Bench-ZH | Avg |
107
- |---|---:|---:|---:|
108
- | Seedream 4.5 | **0.9890** | **0.9873** | **0.9882** |
109
- | ERNIE-Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 |
110
- | GLM-Image | 0.9524 | 0.9788 | 0.9656 |
111
- | ERNIE-Image-Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 |
112
- | Nano Banana 2.0 | 0.9808 | 0.9491 | 0.9650 |
113
- | ERNIE-Image-Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 |
114
- | ERNIE-Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 |
115
- | Qwen-Image-2512 | 0.9561 | 0.9647 | 0.9604 |
116
- | Qwen-Image | 0.9430 | 0.9460 | 0.9445 |
117
- | Z-Image | 0.9350 | 0.9360 | 0.9355 |
118
- | Seedream 4.0 | 0.9214 | 0.9261 | 0.9238 |
119
- | Z-Image-Turbo | 0.9170 | 0.9260 | 0.9215 |
120
- | FLUX.2-klein-9B | 0.8642 | 0.2183 | 0.5413 |
121
 
122
  ## Quick Start
123
 
 
5
  tags:
6
  - text-to-image
7
  - 8B
8
+ - nf4
9
+ - 4bit
10
+ - quantized
11
  model_size: 7B
12
+ quantized_by: Abhishek Dujari
13
+ base_model:
14
+ - baidu/ERNIE-Image
15
+ base_model_relation: quantized
16
  ---
17
 
18
  # ERNIE-Image
19
 
20
+ Ovedrive version of mixed precision targetting 12GB VRAM or less than. It is not widely tested, please do share your results and optimal steps.
21
+
22
+ thank you https://justlab.ai for the GPUs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ## Quick Start
25