Spaces:

build-small-hackathon
/

DoodleBook

Running on Zero

DoodleBook is a multimodal creative-learning application that turns one child's drawing into a complete six-page picture book. A child uploads a doodle, names the hero, selects one of ten meaningful themes, and chooses a narrator. The system writes an age-appropriate story, converts the doodle into a consistent storybook character, illustrates every page, narrates the story, exports a printable PDF, and can also produce a matching coloring book.

The complete deployed model stack is only 7B parameters:

Role	Model	Parameters
Story author and scene planner	`openbmb/MiniCPM5-1B`	1B
Expressive narration and zero-shot voice cloning	`openbmb/VoxCPM2`	2B
Character design, illustrations, and coloring-page redraws	`black-forest-labs/FLUX.2-klein-4B`	4B

The central product idea is that the child's drawing is not merely an input image. It is the identity anchor for the whole experience. The same hero appears in the narrative, the six illustrations, the narration, the story PDF, and the optional coloring book.

1. Objective

Children rarely see their own imperfect drawings treated as finished creative work. Generative applications can produce polished images, but they often replace the child's idea rather than extend it. DoodleBook has a different objective:

Preserve the personality of a child's drawing and make that drawing the hero of a meaningful story the child can read, hear, keep, print, and color.

That objective creates five technical requirements:

The story must be joyful, understandable, and emotionally meaningful for a young child.
The selected theme must influence the actual lesson and plot, not just appear as a label.
The hero must remain visually recognizable across all six illustrations.
The selected narrator or uploaded family voice must match the final audio.
The complete pipeline must run as a public Hugging Face Space within ZeroGPU constraints.

2. The User Experience

The current Space deliberately keeps the input small:

Upload or photograph a drawing.
Enter the hero's name.
Select one of ten story themes.
Select a narrator, including the optional My Voice mode.
Choose whether to generate a matching coloring book.
Select Make my book.

The available themes are:

brave adventure
making a new friend
overcoming a fear
helping someone
lost and found
learning something new
kindness to animals
the magic of imagination
celebrating who you are
a rainy day adventure

The narrator choices are Little Kid, Big Kid, Playful, Storyteller, Grandpa, and My Voice. For My Voice, the user uploads a short clear recording and VoxCPM2 performs zero-shot voice cloning from that reference.

The result is revealed in a deliberate order:

Narration audio becomes available first.
The illustrated story PDF download appears next.
The full illustrated pages are revealed last.
If requested, the coloring-book preview and PDF are also produced.

This order gives the child something to hear as soon as possible while the heavier document and page-rendering work completes.

3. End-to-End Architecture

Child's doodle + hero name + theme + narrator
                         |
                         v
            MiniCPM5-1B story generation
          title + character + text + scenes
                    /              \
                   /                \
                  v                  v
       FLUX canonical hero       VoxCPM2 narration
                  |             preset or cloned voice
                  v
        FLUX six page renders
                  |
          +-------+--------+
          |                |
          v                v
   Illustrated PDF   FLUX line-art redraw
                           |
                           v
                    Coloring-book PDF

The pipeline is orchestrated in a Gradio 6 application hosted on Hugging Face Spaces. The models are loaded using the ZeroGPU-compatible module-scope pattern, and GPU-heavy stages are isolated behind @spaces.GPU functions.

4. MiniCPM5-1B: Story Author and Visual Planner

MiniCPM5-1B does more than write prose. It produces a structured JSON plan containing:

a short book title
a reusable visual description of the hero
six page texts
six corresponding illustration scene descriptions

Each scene description is consumed directly by the image pipeline, so the language model acts as both author and visual director.

Theme-conditioned meaning

Every selectable theme has its own narrative guidance. For example:

Overcoming a fear treats fear respectfully and shows courage as a supported small step.
Making a new friend rewards listening, sharing, or a sincere hello rather than popularity.
Learning something new includes an imperfect attempt, guidance, practice, and improvement.
Celebrating who you are focuses on self-acceptance without making the hero superior.
Kindness to animals emphasizes gentle, age-appropriate care and trusted adult support.

This prevents ten buttons from producing the same generic adventure with different titles.

Prompt design for a small model

The prompt is detailed but organized into a small number of explicit responsibilities:

Story quality: one coherent arc, natural read-aloud language, dialogue, humor, and purposeful sound effects such as BOOM!, WHOOSH!, or SPLASH!.
Emotional meaning: the lesson must follow from what the hero actually chooses and does.
Page progression: introduction, first attempt, complication, realization, action, and emotional payoff.
Visual continuity: the hero remains active on every page and retains the same memorable physical traits.
Illustration planning: each scene describes one drawable moment with visible characters, action, setting, props, light, color mood, and emotion.
Strict output: valid JSON only, with exactly six page objects.

A complete six-page exemplar establishes the quality target. The example demonstrates warmth, continuity, useful dialogue, sound effects, a safe treatment of fear, and a lesson shown through action rather than attached as an unrelated final sentence.

Reliability

The output parser extracts and repairs JSON when possible. If language-model generation fails, the application has local theme-based story arcs so the user still receives a book rather than a broken session. Errors and fallback information are exposed in the trace instead of being hidden.

5. FLUX.2-klein-4B: One Drawing, One Hero, Six Pages

The hardest visual problem is cross-page character consistency. Six independent text-to-image requests tend to produce six different heroes. Color, clothing, face shape, body proportions, and accessories can drift enough to break the feeling that this is one book.

DoodleBook solves this without training a new model for every child.

Stage 1: canonical-character generation

The uploaded doodle is passed through FLUX img2img once. The prompt instructs FLUX to:

preserve the creature or person type
preserve face, body shape, colors, markings, clothing, and accessories
clarify unclear lines without replacing the child's idea
produce one full-body hero on a neutral background
avoid extra characters, scenery, text, or duplicate views

This output becomes a canonical model sheet for the book.

Stage 2: reference-conditioned page generation

Each story scene is rendered using the same canonical hero image. The page prompt explicitly locks:

face and species
body proportions
colors and markings
clothing and accessories
child-safe crayon storybook style

The scene still changes from page to page, but the identity reference remains constant. Deterministic seeds provide reproducibility while page-specific seed offsets allow variation.

If the user does not upload a doodle, the structured character description from MiniCPM5-1B becomes the text-based identity anchor.

Why this approach matters

Per-user LoRA training would add a training job before every book. That is too slow and operationally expensive for an interactive children's application. Canonical image conditioning provides personalization at inference time, using the child's actual drawing as the reference.

6. VoxCPM2: Expressive Narration and Custom Family Voices

VoxCPM2 converts the title and all six page texts into one narrated story. The application splits the text into sentences, generates speech for each sentence, and inserts short pauses between them so the result sounds like a book being read rather than one continuous block.

Narrator presets

Each narrator option uses a separate voice-design instruction:

Little Kid: bright, playful, and full of wonder
Big Kid: confident, youthful, and energetic
Playful: animated delivery and light comic timing
Storyteller: soft, soothing bedtime pacing
Grandpa: warm, patient, and reassuring
My Voice: preserves the uploaded reference speaker while adding clear storytelling delivery

The voice prompts also explain how to perform dialogue and sound effects. A bedtime voice softens BOOM!; a playful voice gives it energy without becoming harsh.

My Voice: zero-shot cloning, not fine-tuning

The deployed custom voice feature does not train a new audio model. VoxCPM2 is loaded with its denoiser enabled and receives the uploaded recording through reference_wav_path. It performs zero-shot voice cloning at inference time.

This distinction is important:

no user-specific checkpoint is created
no training job is required
the user can hear a familiar family voice from one short reference recording
custom-voice narration is scheduled sequentially after images to avoid concurrent ZeroGPU contention

Preset narration can run in parallel with illustration generation. Custom cloning is slower, so the number of cloned sentences is capped to fit the public GPU budget.

7. Fine-Tuning and Adaptation: What We Built

DoodleBook uses several forms of model adaptation, but they should not be confused.

Deployed in the current Space

Technique	Model	Status
Theme-specific structured prompting and few-shot story guidance	MiniCPM5-1B	Deployed
Canonical-image conditioning from the child's doodle	FLUX.2-klein-4B	Deployed
Deterministic seed and character-description identity anchors	FLUX.2-klein-4B	Deployed
Voice-design prompting for six narrator styles	VoxCPM2	Deployed
Zero-shot voice cloning from uploaded reference audio	VoxCPM2	Deployed
Img2img semantic redraw for matching coloring pages	FLUX.2-klein-4B	Deployed

Fine-tuning work in the repository

The repository also contains an experimental Kannada narration path referencing sush0401/IndicF5-Kannada-Bedtime-v2, a bedtime-oriented IndicF5 checkpoint, with MMS-TTS and gTTS fallback tiers. This work explored language-specific expressive narration and reference-voice support.

The current competition Space was simplified to one English storybook flow powered by VoxCPM2, so the Kannada checkpoint is not loaded by the deployed app.py path. It is supporting experimental work, not a dependency of the live demo.

The repository also includes a FLUX LoRA training scaffold for a more uniform crayon style. No published LoRA is required by the current Space. The live visual consistency comes from canonical image conditioning, prompt constraints, and deterministic seeds.

This transparent separation makes the deployed result reproducible and avoids overstating fine-tuning that is not active in the public demo.

8. Coloring Book: Redraw, Do Not Trace

A naive coloring-book implementation applies edge detection to the finished crayon pages. That preserves every crayon grain, shadow, and background texture, producing noisy outlines that are unpleasant to color.

DoodleBook instead sends each finished color page back to FLUX with an img2img prompt that asks for the same characters, action, emotion, and composition as clean line art. The model performs a semantic redraw rather than a literal texture trace.

A lightweight local cleanup then:

converts the result to grayscale
increases contrast
thresholds it toward pure black and white
removes small speckles

If the FLUX line-art pass fails, the application can fall back to local outline extraction.

9. PDFs and a Product Children Can Keep

The generated book is not limited to a temporary browser result. DoodleBook creates:

a browser-readable illustrated book
a downloadable illustrated story PDF
narration audio
an optional browser-readable coloring book
an optional printable coloring-book PDF

The PDF cover follows the same warm paper-and-crayon visual language as the application. This matters to the objective: the output becomes something a child can keep, share with family, read at bedtime, or color away from a screen.

10. ZeroGPU Engineering

The public application runs on Hugging Face Spaces using one ZeroGPU Space.

Important implementation decisions include:

loading the three primary models at module scope in the ZeroGPU-compatible pattern
separating story, image, coloring, and TTS work into GPU-decorated functions
using six-step FLUX generation at guidance scale 1.0 for interactive latency
running preset narration alongside illustration work
running custom voice cloning sequentially to avoid GPU contention
streaming heartbeat status updates during long generation stages
loading a pre-generated sample book immediately without requiring GPU allocation
preserving partial usability through story, image, and coloring fallback paths

The UI is a custom Gradio Blocks interface rather than a default component stack. It uses paper textures, hand-drawn borders, crayon colors, and child-friendly typography while remaining usable on desktop and mobile.

11. Open Trace and Reproducibility

Every generated book exposes a trace containing:

selected hero name
selected theme
selected narrator
coloring-book selection
model backend
character description
deterministic seed
story, image, narration, PDF, and coloring timings
rendering engine
surfaced load or generation errors
fallback details

This serves two audiences. Developers can understand performance and failure modes, while judges and users can see that the result came from a reproducible pipeline rather than a hidden manual process.

12. Why the Model Stack Is Small but Effective

DoodleBook's full stack is 7B parameters, and its authoring plus performance core is only 3B:

MiniCPM5-1B decides what happens and how each page should look.
VoxCPM2 performs the authored story and optionally adapts to a family voice.
FLUX.2-klein-4B renders the visual plan.

This division of responsibility is the project's Tiny Titan argument. A large monolithic model is not necessary when small specialist models communicate through a precise intermediate representation.

The JSON story object is that representation. It connects narrative text, character identity, page scenes, narration, browser rendering, and PDF export.

13. Hackathon Contributions

Tiny Titan

The complete application uses 7B parameters, with a 3B story-and-voice core.

OpenBMB

MiniCPM5-1B authors every live story and scene plan. VoxCPM2 performs every live narration and provides the zero-shot custom voice feature.

Black Forest Labs

FLUX.2-klein-4B creates the canonical hero, renders six story pages, and semantically redraws the optional coloring pages.

Hugging Face

Hugging Face Spaces hosts the public demo, ZeroGPU supplies the GPU execution environment, and the Hub stores the code, article, demo assets, model metadata, and reproducible commit history.

Codex

Codex was used as the coding and documentation agent across the project. It helped implement and debug the app, refine prompts, update the story/image/audio generation instructions, prepare the technical article, maintain the README, create deployment commits, and publish updates to the Hugging Face Space.

Open Trace

Generation settings, timings, model state, and fallback information are visible in the app.

Off-Brand

The product has a custom scrapbook and crayon identity rather than a default Gradio appearance.

Field Notes

The repository includes a dedicated technical write-up on cross-page identity: Cross-Page Character Consistency Without Per-User Training.

14. Limitations

Character consistency is strong but not mathematically guaranteed; complex doodles can drift.
Six FLUX illustrations plus optional coloring redraws remain the longest stage.
Zero-shot voice cloning quality depends on the clarity and duration of the uploaded recording.
The current public UI is English-only.
The story validator can verify structure more easily than literary quality.
The current Space uses prompt and reference-image adaptation, not a deployed custom FLUX LoRA.

These are concrete engineering boundaries, not hidden behavior.

15. Roadmap

The next technically justified improvements are:

Add automated story-quality scoring and one controlled rewrite pass.
Add stronger visual identity evaluation between the canonical hero and each page.
Complete and evaluate the crayon-style FLUX LoRA against the current prompt-only baseline.
Reintroduce multilingual narration only after the language-specific path meets the same reliability and latency requirements as the English Space.
Add explicit consent and retention controls around uploaded family voice references.
Add a short shareable video export combining page turns, narration, and captions.

16. Demo Video

The published demo video is available here:

Watch the MP4 demo

Open the Supademo walkthrough

View the X/Twitter post

It walks through the complete process:

Show the original child's doodle.
Select the hero name, a theme, and a narrator.
Start generation and show the streamed progress.
Play a short narration excerpt.
reveal the same hero across several illustrated pages.
Download the story PDF.
Show the matching coloring-book page.
End on the 7B model stack and live Space URL.

The video is linked from both this article and the Space README.

Conclusion

DoodleBook is not simply text generation followed by image generation. It is one coordinated multimodal artifact built around a child's idea.

MiniCPM5-1B gives that idea narrative structure and meaning. FLUX.2-klein-4B preserves its visual identity across a complete book. VoxCPM2 gives it a voice, including the option of a familiar family voice without a training job. Hugging Face Spaces and ZeroGPU make the entire experience publicly accessible.

The result is small enough to fit the hackathon's technical constraints, but complete enough to feel magical: one drawing becomes a story a child can see, hear, print, keep, and make their own.