jc-builds
/

ERNIE-Image-Turbo-iOS

@@ -80,6 +80,30 @@ let image = try await engine.generate(.init(
 ))
 ```
 ## Why ERNIE-Image-Turbo
 If you need **text inside images** that actually renders correctly (signs, labels, captions, UI mocks), this is currently the strongest open-weight option. Mid-2025 evaluations showed ERNIE-Image meeting or beating GPT-Image-1 on text rendering despite being 1/10th the size.

 ))
 ```
+## Prompting guide
+ERNIE-Image-Turbo conditions on a **Mistral3-3B** text encoder. The upstream Baidu README shows two prompt styles working well: short, dense, comma-separated phrases (`"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"`) and long photograph-style narration (`"This is a photograph depicting an urban street scene. Shot at eye level..."`). Long prompts win when the scene has multiple subjects, structured layout, or rendered text.
+**ERNIE's particular strength: text inside the image.** Put the exact text you want rendered in **quotes** in the prompt and the model will reproduce it on a sign, poster, label, or UI mock with very high fidelity for an open-weight model.
+**Common pitfall: fusion concepts.** Prompts of the form *"X as Y"* (e.g. "The Statue of Liberty as a dog") fail because the text encoder parses `Statue of Liberty` and `dog` as two separate noun phrases, and the diffusion model paints both side-by-side. To get a single **fused** subject, write the fused entity as **one noun** and describe pose / attributes / setting that pull in the second concept.
+| Simple English (what fails) | Prompt that works on ERNIE-Image-Turbo |
+|---|---|
+| "The Statue of Liberty as a dog" | This is a photograph of a colossal bronze statue depicting a golden retriever in the exact pose of the Statue of Liberty: standing upright on its hind legs atop a stone pedestal, right paw holding a flaming torch raised toward the sky, left paw clutching a stone tablet engraved with the date "MCMXXVI", a seven-pointed crown on its head, weathered green-bronze patina, New York harbor in the background, golden hour lighting, photorealistic |
+| "A coffee shop sign" | A photograph of a coffee shop storefront on a brick-fronted street, a vintage glowing red neon sign mounted above the door that reads "BREW & CO." with the words "Coffee · Pastries · Open Late" in smaller letters below, early morning warm light, a few people visible through the window, photorealistic, 35mm lens |
+| "Movie poster of a detective" | A vintage 1940s-style movie poster, a hard-boiled detective in a trench coat and fedora silhouetted against a foggy alley, dramatic chiaroscuro lighting, large red art-deco title reading "THE LAST CASE" at the top, smaller credits text at the bottom reading "Starring · Directed by · Music by", noir aesthetic, paper texture |
+| "A child's bedroom" | A wide-angle photograph of a child's bedroom in a suburban home, twin bed with a galaxy-print comforter, a wall map of the solar system, a small wooden desk with a half-finished science fair poster reading "WHY THE SKY IS BLUE" in marker, late afternoon sunlight slanting through the window, photorealistic |
+| "An infographic on healthy eating" | A clean infographic illustration about healthy eating, top banner text reading "EAT THE RAINBOW", below it a circular chart divided into colored sections labeled "VEGETABLES", "FRUITS", "WHOLE GRAINS", "PROTEIN", "DAIRY", each section illustrated with simple flat-vector food icons, soft pastel palette, modern editorial design |
+**Heuristics that work well on ERNIE-Image-Turbo:**
+- **Quote any text you want rendered.** `the sign reads "BREW & CO."` performs much better than `a sign saying brew and co`.
+- **Open with the medium.** `"This is a photograph of..."`, `"A movie poster showing..."`, `"An infographic about..."` anchors composition early — ERNIE is particularly good at structured layouts (posters, comics, UI mocks).
+- **Short prompts work too, in a tag-style register.** `"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"` is from the upstream Baidu examples. Use this style when you want atmosphere without committing to a specific scene.
+- **For fusion concepts**: pick one noun for the fused subject, then *attribute* the other concept via pose / clothing / context. "A golden retriever statue *in the pose of* the Statue of Liberty" works; "A golden retriever *and* the Statue of Liberty" doesn't.
+- **Bundle ships with Mistral3** — the text encoder is instruction-tuned, so prompts written as natural-language descriptions of a scene generally outperform pure keyword salad.
 ## Why ERNIE-Image-Turbo
 If you need **text inside images** that actually renders correctly (signs, labels, captions, UI mocks), this is currently the strongest open-weight option. Mid-2025 evaluations showed ERNIE-Image meeting or beating GPT-Image-1 on text rendering despite being 1/10th the size.