docs: add prompting guide + simple-vs-ideal prompt table (Mistral3 encoder; quote-text for in-image rendering)
Browse files
README.md
CHANGED
|
@@ -80,6 +80,30 @@ let image = try await engine.generate(.init(
|
|
| 80 |
))
|
| 81 |
```
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
## Why ERNIE-Image-Turbo
|
| 84 |
|
| 85 |
If you need **text inside images** that actually renders correctly (signs, labels, captions, UI mocks), this is currently the strongest open-weight option. Mid-2025 evaluations showed ERNIE-Image meeting or beating GPT-Image-1 on text rendering despite being 1/10th the size.
|
|
|
|
| 80 |
))
|
| 81 |
```
|
| 82 |
|
| 83 |
+
## Prompting guide
|
| 84 |
+
|
| 85 |
+
ERNIE-Image-Turbo conditions on a **Mistral3-3B** text encoder. The upstream Baidu README shows two prompt styles working well: short, dense, comma-separated phrases (`"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"`) and long photograph-style narration (`"This is a photograph depicting an urban street scene. Shot at eye level..."`). Long prompts win when the scene has multiple subjects, structured layout, or rendered text.
|
| 86 |
+
|
| 87 |
+
**ERNIE's particular strength: text inside the image.** Put the exact text you want rendered in **quotes** in the prompt and the model will reproduce it on a sign, poster, label, or UI mock with very high fidelity for an open-weight model.
|
| 88 |
+
|
| 89 |
+
**Common pitfall: fusion concepts.** Prompts of the form *"X as Y"* (e.g. "The Statue of Liberty as a dog") fail because the text encoder parses `Statue of Liberty` and `dog` as two separate noun phrases, and the diffusion model paints both side-by-side. To get a single **fused** subject, write the fused entity as **one noun** and describe pose / attributes / setting that pull in the second concept.
|
| 90 |
+
|
| 91 |
+
| Simple English (what fails) | Prompt that works on ERNIE-Image-Turbo |
|
| 92 |
+
|---|---|
|
| 93 |
+
| "The Statue of Liberty as a dog" | This is a photograph of a colossal bronze statue depicting a golden retriever in the exact pose of the Statue of Liberty: standing upright on its hind legs atop a stone pedestal, right paw holding a flaming torch raised toward the sky, left paw clutching a stone tablet engraved with the date "MCMXXVI", a seven-pointed crown on its head, weathered green-bronze patina, New York harbor in the background, golden hour lighting, photorealistic |
|
| 94 |
+
| "A coffee shop sign" | A photograph of a coffee shop storefront on a brick-fronted street, a vintage glowing red neon sign mounted above the door that reads "BREW & CO." with the words "Coffee · Pastries · Open Late" in smaller letters below, early morning warm light, a few people visible through the window, photorealistic, 35mm lens |
|
| 95 |
+
| "Movie poster of a detective" | A vintage 1940s-style movie poster, a hard-boiled detective in a trench coat and fedora silhouetted against a foggy alley, dramatic chiaroscuro lighting, large red art-deco title reading "THE LAST CASE" at the top, smaller credits text at the bottom reading "Starring · Directed by · Music by", noir aesthetic, paper texture |
|
| 96 |
+
| "A child's bedroom" | A wide-angle photograph of a child's bedroom in a suburban home, twin bed with a galaxy-print comforter, a wall map of the solar system, a small wooden desk with a half-finished science fair poster reading "WHY THE SKY IS BLUE" in marker, late afternoon sunlight slanting through the window, photorealistic |
|
| 97 |
+
| "An infographic on healthy eating" | A clean infographic illustration about healthy eating, top banner text reading "EAT THE RAINBOW", below it a circular chart divided into colored sections labeled "VEGETABLES", "FRUITS", "WHOLE GRAINS", "PROTEIN", "DAIRY", each section illustrated with simple flat-vector food icons, soft pastel palette, modern editorial design |
|
| 98 |
+
|
| 99 |
+
**Heuristics that work well on ERNIE-Image-Turbo:**
|
| 100 |
+
|
| 101 |
+
- **Quote any text you want rendered.** `the sign reads "BREW & CO."` performs much better than `a sign saying brew and co`.
|
| 102 |
+
- **Open with the medium.** `"This is a photograph of..."`, `"A movie poster showing..."`, `"An infographic about..."` anchors composition early — ERNIE is particularly good at structured layouts (posters, comics, UI mocks).
|
| 103 |
+
- **Short prompts work too, in a tag-style register.** `"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"` is from the upstream Baidu examples. Use this style when you want atmosphere without committing to a specific scene.
|
| 104 |
+
- **For fusion concepts**: pick one noun for the fused subject, then *attribute* the other concept via pose / clothing / context. "A golden retriever statue *in the pose of* the Statue of Liberty" works; "A golden retriever *and* the Statue of Liberty" doesn't.
|
| 105 |
+
- **Bundle ships with Mistral3** — the text encoder is instruction-tuned, so prompts written as natural-language descriptions of a scene generally outperform pure keyword salad.
|
| 106 |
+
|
| 107 |
## Why ERNIE-Image-Turbo
|
| 108 |
|
| 109 |
If you need **text inside images** that actually renders correctly (signs, labels, captions, UI mocks), this is currently the strongest open-weight option. Mid-2025 evaluations showed ERNIE-Image meeting or beating GPT-Image-1 on text rendering despite being 1/10th the size.
|