baidu
/

ERNIE-Image

@@ -8,7 +8,6 @@ tags:
 # ERNIE-Image
 <p align="center">
   <a href="https://huggingface.co/Baidu/ERNIE-Image">🤗 ERNIE-Image</a> &nbsp;|&nbsp;
   <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">🤗 ERNIE-Image-Turbo</a> &nbsp;|&nbsp;
@@ -17,11 +16,10 @@ tags:
   <a href="TODO">🖼️ Gallery</a>
 </p>
 ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT) and paired with a lightweight Prompt Enhancer that expands brief user inputs into richer structured descriptions. With only 8B DiT parameters, it reaches state-of-the-art performance among open-weight text-to-image models. The model is designed not only for strong visual quality, but also for controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image performs strongly on complex instruction following, text rendering, and structured image generation, making it well suited for commercial posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and precise control. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and more stylized aesthetic outputs.
 <p align="center">
-  <img src="https://cdn-uploads.huggingface.co/production/uploads/5f8d780e5d083370c711f575/zDC-EOfPO6RAFIE6xD1SW.jpeg" alt="ERNIE-Image Mosaic" width="100%">
 </p>
 **Highlights:**
@@ -40,7 +38,7 @@ ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Ima
 ## Benchmark
-### GenEval
 | Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall |
 |---|---:|---:|---:|---:|---:|---:|---:|
@@ -135,9 +133,9 @@ pipe = ErnieImagePipeline.from_pretrained(
 ).to("cuda")
 image = pipe(
-    prompt="A cinematic movie poster of a futuristic city at night with clear neon signage.",
-    height=1024,
-    width=1024,
     num_inference_steps=50,
     guidance_scale=4.0,
     use_pe=True # use prompt enhancer
@@ -165,9 +163,9 @@ Send a generation request:
 curl -X POST http://localhost:30000/generate \
   -H "Content-Type: application/json" \
   -d '{
-    "prompt": "一只黑白相间的中华田园犬",
-    "height": 1024,
-    "width": 1024,
     "num_inference_steps": 50,
     "guidance_scale": 4.0,
     "use_pe": true

 # ERNIE-Image
 <p align="center">
   <a href="https://huggingface.co/Baidu/ERNIE-Image">🤗 ERNIE-Image</a> &nbsp;|&nbsp;
   <a href="https://huggingface.co/Baidu/ERNIE-Image-Turbo">🤗 ERNIE-Image-Turbo</a> &nbsp;|&nbsp;
   <a href="TODO">🖼️ Gallery</a>
 </p>
 ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT) and paired with a lightweight Prompt Enhancer that expands brief user inputs into richer structured descriptions. With only 8B DiT parameters, it reaches state-of-the-art performance among open-weight text-to-image models. The model is designed not only for strong visual quality, but also for controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image performs strongly on complex instruction following, text rendering, and structured image generation, making it well suited for commercial posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and precise control. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and more stylized aesthetic outputs.
 <p align="center">
+  <img src="https://cdn-uploads.huggingface.co/production/uploads/5f8d780e5d083370c711f575/QRt1mPSU9SCkcxxFWQje2.jpeg" alt="ERNIE-Image Mosaic" width="100%">
 </p>
 **Highlights:**
 ## Benchmark
+### GENEval
 | Model | Single Object | Two Object | Counting | Colors | Position | Attribute Binding | Overall |
 |---|---:|---:|---:|---:|---:|---:|---:|
 ).to("cuda")
 image = pipe(
+    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
+    height=1264,
+    width=848,
     num_inference_steps=50,
     guidance_scale=4.0,
     use_pe=True # use prompt enhancer
 curl -X POST http://localhost:30000/generate \
   -H "Content-Type: application/json" \
   -d '{
+    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
+    "height": 1264,
+    "width": 848,
     "num_inference_steps": 50,
     "guidance_scale": 4.0,
     "use_pe": true