ERNIE-Image-Turbo — iOS bundle

A mobile-friendly bundle of ERNIE-Image-Turbo — Baidu's 8B single-stream DiT, distilled for fast inference, with state-of-the-art text rendering quality among open-weight models. Bundled with Ministral 3B text encoder + ERNIE's own 32-channel AutoencoderKLFlux2 VAE for on-device inference via Mirage.

ERNIE-Image-Turbo is particularly strong at:

Photorealism at 1024×1024
Accurate text rendering inside images — best-in-class among open models
Speed — distilled to few-step sampling

What's inside

File	Role	Size
`ernie-image-turbo-Q3_K_M.gguf`	Diffusion transformer — 8B params, Q3_K_M	3.6 GB
`Ministral-3-3B-Instruct-2512-Q4_K_M.gguf`	Text encoder (Mistral3 emits the 3072-dim conditioning tensor ERNIE expects)	2.0 GB
`ae.safetensors`	VAE — ERNIE's 32-channel `AutoencoderKLFlux2` (≠ Flux's 16-channel `ae.safetensors`; the two are not interchangeable)	168 MB
`safety_negative_prompt.txt`	Recommended default negative prompt to apply at inference time for SFW-by-default deployments	<1 KB

Total bundle: ~5.7 GB. Total GPU residency: ~7 GB. iPhone 16 Pro / 17 Pro / Mac territory.

Safety / SFW-by-default

This bundle is intended for shipping in consumer apps and ships with a recommended default negative prompt at safety_negative_prompt.txt. Consumers building on top of this bundle SHOULD load the file and prepend its contents to any user-supplied negative prompt by default, with an explicit user-facing opt-out for adult/artistic contexts.

The blocklist covers:

Child safety — explicit terms blocking sexualised content involving minors or apparent minors (loaded first / highest weight in SD-style negative prompts)
Adult / explicit — nsfw, nude, explicit, sexual, anatomical detail
Gore + graphic violence — gore, blood, mutilation, etc.
Hate symbols — swastika, nazi, extremist

Diffusion models steer away from negative-prompt concepts; they don't binary-reject them. A sufficiently determined prompt can still produce undesirable output, so apps shipping this bundle to general audiences should pair the negative-prompt filter with output-side classification (e.g. a CSAM/NSFW classifier on the generated CGImage) before display.

Quick start (Mirage)

import Mirage

let docs = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]

let engine = try Engine(models: ModelFiles(
    diffusionModel: docs.appendingPathComponent("ernie-image-turbo-Q3_K_M.gguf"),
    vae:            docs.appendingPathComponent("ae.safetensors"),
    textEncoder:    docs.appendingPathComponent("Ministral-3-3B-Instruct-2512-Q4_K_M.gguf")
))

let image = try await engine.generate(.init(
    prompt: "a vintage diner sign that reads \"OPEN 24/7\" in red neon, dusk lighting, photorealistic",
    width: 1024, height: 1024,
    steps: 8,         // Turbo distillation
    cfgScale: 1.0     // CFG is baked in
))

Prompting guide

ERNIE-Image-Turbo conditions on a Mistral3-3B text encoder. The upstream Baidu README shows two prompt styles working well: short, dense, comma-separated phrases ("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k") and long photograph-style narration ("This is a photograph depicting an urban street scene. Shot at eye level..."). Long prompts win when the scene has multiple subjects, structured layout, or rendered text.

ERNIE's particular strength: text inside the image. Put the exact text you want rendered in quotes in the prompt and the model will reproduce it on a sign, poster, label, or UI mock with very high fidelity for an open-weight model.

The icon-attractor problem

When your prompt fuses two well-known concepts (Statue of Liberty + dog, American Gothic + corgis, Tony Soprano + golden retriever), the diffusion transformer's cross-attention often collapses toward whichever concept it has seen photographed thousands of times — and ignores the other. Encoder-side, Mistral3 reads your prompt correctly; the failure happens at the DiT's denoising stage, where strong "icon attractors" overwhelm the creative twist at the locked turbo CFG of 1.0.

If you write "a bronze statue of a golden retriever ... on Liberty Island ... with the New York harbor" the model usually paints just the Statue of Liberty. The dog token loses the attention competition.

Four mitigations that actually work:

Strip the icon's name from the prompt. Don't say "Statue of Liberty", "American Gothic", "Tony Soprano", "Picard". Describe only the visual properties (pose, costume, setting). The icon attractor is summoned by the proper noun more than by visual descriptors.
Lead with the underdog concept. First tokens get more attention weight. Start with "A golden retriever..." not "A statue of...".
Reinforce anatomy / species multiple times. Every mention of "floppy ears", "snout", "paw", "fur" adds weight to the underdog attractor. The icon's anatomy (face, robe, crown) only gets named once or zero times.
Use a negative prompt to subtract the icon. With CFG locked at 1.0 you can't crank prompt adherence directly, but the negative prompt still subtracts attractors. Listing "human face, human person, woman, robe, gown" pushes the model away from the Statue-of-Liberty attractor explicitly.

Some prompts are genuinely hard and may need multiple seeds. When all else fails, image-to-image (start from a photo of the underdog subject, apply the prompt at moderate strength) is the industry workaround — not yet exposed by Mirage's public API.

Examples — viral scroll-video set

Idea	ERNIE-Image-Turbo prompt
Loch Ness selfie	A slightly washed-out iPhone selfie photograph posted to Instagram, a long-necked plesiosaur-style aquatic reptile taking the selfie with its front flipper holding the phone, half-submerged in a Scottish loch with green hills and a stone bridge visible behind, the creature making a "duck face" expression with closed eyes, slight motion blur, front-facing-camera color palette, photorealistic. Negative: cartoon, drawing, illustration
American Gothic corgis	A photograph of two Pembroke Welsh corgis standing side-by-side in front of a white wooden farmhouse with a tall gothic-arched window, vertical portrait composition, the front corgi holding an upright steel pitchfork with the prongs facing up, the back corgi looking forward with the same stern expression, both with floppy ears and stocky bodies, overcast midwestern light, dusty rural setting, photorealistic. Negative: human face, person, man, woman, farmer, anthropomorphic
Vending machine on Everest	This is a photograph at the summit of Mount Everest in the snow, a fully-illuminated modern red Coca-Cola branded vending machine standing upright in the snow with the words "ICE COLD" lit up on its front, fluorescent interior light, glass front showing stocked sodas and snacks, three climbers in red and orange expedition gear standing in a polite line, prayer flags blowing on a string overhead, golden alpenglow on snow-covered peaks behind, photorealistic
Mona Lisa barista	A photograph of a busy modern coffee shop in the East Village at morning rush, the barista behind the polished espresso machine is a woman with the exact face and slight smile of the Mona Lisa, wearing the dark Renaissance dress of the painting under a beige work apron with a name tag that reads "LISA", holding a metal portafilter in her hands, soft natural window light from camera left, chalkboard menu and pastry case behind her, photorealistic. Negative: art gallery, museum, oil painting frame, classical art exhibition
Astronauts on the subway	A photograph from inside a busy New York City subway car at rush hour, completely packed with people in full white NASA EVA spacesuits with gold reflective visors down, all holding the silver overhead bars or seated, fluorescent ceiling lights, dirty subway-car interior aesthetic, an MTA route map above the windows with the words "NEXT STOP: 14 ST", one astronaut reading a folded New York Times, perfectly mundane commuter body language, photorealistic, ultra wide-angle
Tony Soprano dog	A cinematic still in HBO premium-cable color grading and shallow depth of field, an adult golden retriever sitting upright in a red vinyl diner booth wearing a half-buttoned black silk bowling shirt over its chest, the dog's eyes fixed on the camera with a calm watchful expression, an onion ring held halfway to its open mouth on a fork, a jukebox glowing warm yellow behind the booth, plates of food on the table, late-night Northeast diner atmosphere, film grain. Negative: human face, person, man, woman, mafia boss
Stonehenge in suburbia	An aerial drone photograph of an ordinary cul-de-sac suburban backyard with a green mowed lawn, white picket fence, kids' wooden swing set, plastic pink flamingo, and a full-scale ring of massive weathered stone trilithons in the center of the yard, late afternoon shadows from the stones falling across a trampoline, the homeowner in a t-shirt watering the lawn with a hose in the corner unbothered, photorealistic, banal composition
Picard collie	A cinematic still from a 1990s science fiction television show, a black-and-white border collie sitting in a high-backed captain's chair on the bridge of a starship, the collie wearing a custom-tailored red and black Starfleet command uniform jacket, alert posture with front paws resting on the armrests, ears upright and attentive, a tabby cat at one console and a beagle at another visible at their stations in the background, the main viewscreen showing distant stars, soft 1990s television lighting, photorealistic. Negative: human face, person, man, bald man
Last Supper at Waffle House	A photograph composed as a long horizontal frieze, thirteen figures seated along one side of a long yellow Formica counter under fluorescent lighting at a 24-hour Waffle House in the American South, a neon sign on the wall reads "WAFFLE HOUSE", the central figure in a white robe gesturing with both hands while the others react in varied emotional postures of surprise and concern, plates of hashbrowns and waffles in front of each diner, coffee carafe in the foreground, photorealistic, 3 a.m. atmosphere
Pigeon TED talk	A photograph of a TED talk presentation in progress, the speaker standing alone on a circular red carpet stage is a single common pigeon wearing a tiny black headset microphone wrapped around its head, the pigeon calmly walking across the red carpet mid-stride, audience seated in dark silhouettes listening attentively, the large screen behind the pigeon shows a clean modern infographic titled "FINDING CRUMBS: A DATA-DRIVEN APPROACH", professional event lighting, photorealistic
Eiffel wine glass	A close-up food-photography photograph on a small marble bistro table in Paris, a single tall slender glass of deep red wine, the entire shape of the glass itself sculpted to match the silhouette of the Eiffel Tower with its widening base, narrow middle, and tapered top, delicate iron-lattice patterns etched into the glass surface, a small plate of brie and a sliced baguette beside it, golden hour light from a window, shallow depth of field, romantic atmosphere
Hackathon Rembrandt	A Dutch Golden Age oil painting in the style of Rembrandt's group portraits, dramatic chiaroscuro lighting falling from a single candle and several glowing laptop screens onto five modern programmers in hoodies and graphic t-shirts hunched over a long wooden table, three empty Red Bull cans gleaming in the warm light on the table, one figure pointing at a laptop screen in revelation, the others leaning in with expressions of focused intensity, deep dark background, warm golden palette, visible thick oil brushwork

Heuristics that work well on ERNIE-Image-Turbo

Quote any text you want rendered. the sign reads "BREW & CO." performs much better than a sign saying brew and co. ERNIE's text-in-image fidelity is a real superpower; the viral set above leans on it where it adds composition payoff (ICE COLD, NEXT STOP: 14 ST, WAFFLE HOUSE, LISA, FINDING CRUMBS).
Open with the medium. "This is a photograph of...", "A movie poster showing...", "An infographic about..." anchors composition early — ERNIE is particularly good at structured layouts.
Short prompts work too, in a tag-style register. "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" is from the upstream Baidu examples. Use this style when you want atmosphere without committing to a specific scene.
For dual-attractor fusion concepts: strip the icon's name, lead with the underdog subject, reinforce its anatomy, and use a negative prompt to subtract the icon's attractor. See the four mitigations above.
Mistral3 is instruction-tuned — prompts written as natural-language descriptions of a scene generally outperform pure keyword salad.

Why ERNIE-Image-Turbo

If you need text inside images that actually renders correctly (signs, labels, captions, UI mocks), this is currently the strongest open-weight option. Mid-2025 evaluations showed ERNIE-Image meeting or beating GPT-Image-1 on text rendering despite being 1/10th the size.

Provenance

Component	Upstream	License
Diffusion transformer	baidu/ERNIE-Image	Apache 2.0
GGUF conversion	unsloth/ERNIE-Image-Turbo-GGUF	Apache 2.0
Text encoder	unsloth/Ministral-3-3B-Instruct-2512-GGUF	Apache 2.0
VAE	`baidu/ERNIE-Image-Turbo` — `vae/diffusion_pytorch_model.safetensors`, repacked as `ae.safetensors`	Apache 2.0

Performance (Mirage, rough)

Device	1024² @ 8 steps
iPhone 17 Pro	~3 min
iPhone 16 Pro	~5 min
M2 / M3 Mac	~6 min

Built by

Haplo · Mirage on GitHub

Downloads last month: 81

GGUF

Model size

8B params

Architecture

wan

Hardware compatibility

3-bit

4-bit

Model tree for jc-builds/ERNIE-Image-Turbo-iOS

Base model

baidu/ERNIE-Image-Turbo

Quantized

(10)

this model