Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
DoodleBook: Turning a Child's Drawing into a Narrated, Illustrated Storybook
Build Small Hackathon 2026 - Adventure in Thousand Token Wood
Live demo: build-small-hackathon/DoodleBook
Demo video: MP4 demo and Supademo walkthrough
Social post: X/Twitter announcement
Source: github.com/Sushruths04/Doodle-book
License: Apache-2.0
Abstract
DoodleBook is a multimodal creative-learning application that turns one child's drawing into a complete six-page picture book. A child uploads a doodle, names the hero, selects one of ten meaningful themes, and chooses a narrator. The system writes an age-appropriate story, converts the doodle into a consistent storybook character, illustrates every page, narrates the story, exports a printable PDF, and can also produce a matching coloring book.
The complete deployed model stack is only 7B parameters:
| Role | Model | Parameters |
|---|---|---|
| Story author and scene planner | openbmb/MiniCPM5-1B |
1B |
| Expressive narration and zero-shot voice cloning | openbmb/VoxCPM2 |
2B |
| Character design, illustrations, and coloring-page redraws | black-forest-labs/FLUX.2-klein-4B |
4B |
The central product idea is that the child's drawing is not merely an input image. It is the identity anchor for the whole experience. The same hero appears in the narrative, the six illustrations, the narration, the story PDF, and the optional coloring book.
1. Objective
Children rarely see their own imperfect drawings treated as finished creative work. Generative applications can produce polished images, but they often replace the child's idea rather than extend it. DoodleBook has a different objective:
Preserve the personality of a child's drawing and make that drawing the hero of a meaningful story the child can read, hear, keep, print, and color.
That objective creates five technical requirements:
- The story must be joyful, understandable, and emotionally meaningful for a young child.
- The selected theme must influence the actual lesson and plot, not just appear as a label.
- The hero must remain visually recognizable across all six illustrations.
- The selected narrator or uploaded family voice must match the final audio.
- The complete pipeline must run as a public Hugging Face Space within ZeroGPU constraints.
2. The User Experience
The current Space deliberately keeps the input small:
- Upload or photograph a drawing.
- Enter the hero's name.
- Select one of ten story themes.
- Select a narrator, including the optional My Voice mode.
- Choose whether to generate a matching coloring book.
- Select Make my book.
The available themes are:
- brave adventure
- making a new friend
- overcoming a fear
- helping someone
- lost and found
- learning something new
- kindness to animals
- the magic of imagination
- celebrating who you are
- a rainy day adventure
The narrator choices are Little Kid, Big Kid, Playful, Storyteller, Grandpa, and My Voice. For My Voice, the user uploads a short clear recording and VoxCPM2 performs zero-shot voice cloning from that reference.
The result is revealed in a deliberate order:
- Narration audio becomes available first.
- The illustrated story PDF download appears next.
- The full illustrated pages are revealed last.
- If requested, the coloring-book preview and PDF are also produced.
This order gives the child something to hear as soon as possible while the heavier document and page-rendering work completes.
3. End-to-End Architecture
Child's doodle + hero name + theme + narrator
|
v
MiniCPM5-1B story generation
title + character + text + scenes
/ \
/ \
v v
FLUX canonical hero VoxCPM2 narration
| preset or cloned voice
v
FLUX six page renders
|
+-------+--------+
| |
v v
Illustrated PDF FLUX line-art redraw
|
v
Coloring-book PDF
The pipeline is orchestrated in a Gradio 6 application hosted on Hugging Face Spaces.
The models are loaded using the ZeroGPU-compatible module-scope pattern, and GPU-heavy
stages are isolated behind @spaces.GPU functions.
4. MiniCPM5-1B: Story Author and Visual Planner
MiniCPM5-1B does more than write prose. It produces a structured JSON plan containing:
- a short book title
- a reusable visual description of the hero
- six page texts
- six corresponding illustration scene descriptions
Each scene description is consumed directly by the image pipeline, so the language model acts as both author and visual director.
Theme-conditioned meaning
Every selectable theme has its own narrative guidance. For example:
- Overcoming a fear treats fear respectfully and shows courage as a supported small step.
- Making a new friend rewards listening, sharing, or a sincere hello rather than popularity.
- Learning something new includes an imperfect attempt, guidance, practice, and improvement.
- Celebrating who you are focuses on self-acceptance without making the hero superior.
- Kindness to animals emphasizes gentle, age-appropriate care and trusted adult support.
This prevents ten buttons from producing the same generic adventure with different titles.
Prompt design for a small model
The prompt is detailed but organized into a small number of explicit responsibilities:
- Story quality: one coherent arc, natural read-aloud language, dialogue, humor, and
purposeful sound effects such as
BOOM!,WHOOSH!, orSPLASH!. - Emotional meaning: the lesson must follow from what the hero actually chooses and does.
- Page progression: introduction, first attempt, complication, realization, action, and emotional payoff.
- Visual continuity: the hero remains active on every page and retains the same memorable physical traits.
- Illustration planning: each scene describes one drawable moment with visible characters, action, setting, props, light, color mood, and emotion.
- Strict output: valid JSON only, with exactly six page objects.
A complete six-page exemplar establishes the quality target. The example demonstrates warmth, continuity, useful dialogue, sound effects, a safe treatment of fear, and a lesson shown through action rather than attached as an unrelated final sentence.
Reliability
The output parser extracts and repairs JSON when possible. If language-model generation fails, the application has local theme-based story arcs so the user still receives a book rather than a broken session. Errors and fallback information are exposed in the trace instead of being hidden.
5. FLUX.2-klein-4B: One Drawing, One Hero, Six Pages
The hardest visual problem is cross-page character consistency. Six independent text-to-image requests tend to produce six different heroes. Color, clothing, face shape, body proportions, and accessories can drift enough to break the feeling that this is one book.
DoodleBook solves this without training a new model for every child.
Stage 1: canonical-character generation
The uploaded doodle is passed through FLUX img2img once. The prompt instructs FLUX to:
- preserve the creature or person type
- preserve face, body shape, colors, markings, clothing, and accessories
- clarify unclear lines without replacing the child's idea
- produce one full-body hero on a neutral background
- avoid extra characters, scenery, text, or duplicate views
This output becomes a canonical model sheet for the book.
Stage 2: reference-conditioned page generation
Each story scene is rendered using the same canonical hero image. The page prompt explicitly locks:
- face and species
- body proportions
- colors and markings
- clothing and accessories
- child-safe crayon storybook style
The scene still changes from page to page, but the identity reference remains constant. Deterministic seeds provide reproducibility while page-specific seed offsets allow variation.
If the user does not upload a doodle, the structured character description from MiniCPM5-1B becomes the text-based identity anchor.
Why this approach matters
Per-user LoRA training would add a training job before every book. That is too slow and operationally expensive for an interactive children's application. Canonical image conditioning provides personalization at inference time, using the child's actual drawing as the reference.
6. VoxCPM2: Expressive Narration and Custom Family Voices
VoxCPM2 converts the title and all six page texts into one narrated story. The application splits the text into sentences, generates speech for each sentence, and inserts short pauses between them so the result sounds like a book being read rather than one continuous block.
Narrator presets
Each narrator option uses a separate voice-design instruction:
- Little Kid: bright, playful, and full of wonder
- Big Kid: confident, youthful, and energetic
- Playful: animated delivery and light comic timing
- Storyteller: soft, soothing bedtime pacing
- Grandpa: warm, patient, and reassuring
- My Voice: preserves the uploaded reference speaker while adding clear storytelling delivery
The voice prompts also explain how to perform dialogue and sound effects. A bedtime voice softens
BOOM!; a playful voice gives it energy without becoming harsh.
My Voice: zero-shot cloning, not fine-tuning
The deployed custom voice feature does not train a new audio model. VoxCPM2 is loaded with its
denoiser enabled and receives the uploaded recording through reference_wav_path. It performs
zero-shot voice cloning at inference time.
This distinction is important:
- no user-specific checkpoint is created
- no training job is required
- the user can hear a familiar family voice from one short reference recording
- custom-voice narration is scheduled sequentially after images to avoid concurrent ZeroGPU contention
Preset narration can run in parallel with illustration generation. Custom cloning is slower, so the number of cloned sentences is capped to fit the public GPU budget.
7. Fine-Tuning and Adaptation: What We Built
DoodleBook uses several forms of model adaptation, but they should not be confused.
Deployed in the current Space
| Technique | Model | Status |
|---|---|---|
| Theme-specific structured prompting and few-shot story guidance | MiniCPM5-1B | Deployed |
| Canonical-image conditioning from the child's doodle | FLUX.2-klein-4B | Deployed |
| Deterministic seed and character-description identity anchors | FLUX.2-klein-4B | Deployed |
| Voice-design prompting for six narrator styles | VoxCPM2 | Deployed |
| Zero-shot voice cloning from uploaded reference audio | VoxCPM2 | Deployed |
| Img2img semantic redraw for matching coloring pages | FLUX.2-klein-4B | Deployed |
Fine-tuning work in the repository
The repository also contains an experimental Kannada narration path referencing
sush0401/IndicF5-Kannada-Bedtime-v2,
a bedtime-oriented IndicF5 checkpoint, with MMS-TTS and gTTS fallback tiers. This work explored
language-specific expressive narration and reference-voice support.
The current competition Space was simplified to one English storybook flow powered by VoxCPM2,
so the Kannada checkpoint is not loaded by the deployed app.py path. It is supporting
experimental work, not a dependency of the live demo.
The repository also includes a FLUX LoRA training scaffold for a more uniform crayon style. No published LoRA is required by the current Space. The live visual consistency comes from canonical image conditioning, prompt constraints, and deterministic seeds.
This transparent separation makes the deployed result reproducible and avoids overstating fine-tuning that is not active in the public demo.
8. Coloring Book: Redraw, Do Not Trace
A naive coloring-book implementation applies edge detection to the finished crayon pages. That preserves every crayon grain, shadow, and background texture, producing noisy outlines that are unpleasant to color.
DoodleBook instead sends each finished color page back to FLUX with an img2img prompt that asks for the same characters, action, emotion, and composition as clean line art. The model performs a semantic redraw rather than a literal texture trace.
A lightweight local cleanup then:
- converts the result to grayscale
- increases contrast
- thresholds it toward pure black and white
- removes small speckles
If the FLUX line-art pass fails, the application can fall back to local outline extraction.
9. PDFs and a Product Children Can Keep
The generated book is not limited to a temporary browser result. DoodleBook creates:
- a browser-readable illustrated book
- a downloadable illustrated story PDF
- narration audio
- an optional browser-readable coloring book
- an optional printable coloring-book PDF
The PDF cover follows the same warm paper-and-crayon visual language as the application. This matters to the objective: the output becomes something a child can keep, share with family, read at bedtime, or color away from a screen.
10. ZeroGPU Engineering
The public application runs on Hugging Face Spaces using one ZeroGPU Space.
Important implementation decisions include:
- loading the three primary models at module scope in the ZeroGPU-compatible pattern
- separating story, image, coloring, and TTS work into GPU-decorated functions
- using six-step FLUX generation at guidance scale 1.0 for interactive latency
- running preset narration alongside illustration work
- running custom voice cloning sequentially to avoid GPU contention
- streaming heartbeat status updates during long generation stages
- loading a pre-generated sample book immediately without requiring GPU allocation
- preserving partial usability through story, image, and coloring fallback paths
The UI is a custom Gradio Blocks interface rather than a default component stack. It uses paper textures, hand-drawn borders, crayon colors, and child-friendly typography while remaining usable on desktop and mobile.
11. Open Trace and Reproducibility
Every generated book exposes a trace containing:
- selected hero name
- selected theme
- selected narrator
- coloring-book selection
- model backend
- character description
- deterministic seed
- story, image, narration, PDF, and coloring timings
- rendering engine
- surfaced load or generation errors
- fallback details
This serves two audiences. Developers can understand performance and failure modes, while judges and users can see that the result came from a reproducible pipeline rather than a hidden manual process.
12. Why the Model Stack Is Small but Effective
DoodleBook's full stack is 7B parameters, and its authoring plus performance core is only 3B:
- MiniCPM5-1B decides what happens and how each page should look.
- VoxCPM2 performs the authored story and optionally adapts to a family voice.
- FLUX.2-klein-4B renders the visual plan.
This division of responsibility is the project's Tiny Titan argument. A large monolithic model is not necessary when small specialist models communicate through a precise intermediate representation.
The JSON story object is that representation. It connects narrative text, character identity, page scenes, narration, browser rendering, and PDF export.
13. Hackathon Contributions
Tiny Titan
The complete application uses 7B parameters, with a 3B story-and-voice core.
OpenBMB
MiniCPM5-1B authors every live story and scene plan. VoxCPM2 performs every live narration and provides the zero-shot custom voice feature.
Black Forest Labs
FLUX.2-klein-4B creates the canonical hero, renders six story pages, and semantically redraws the optional coloring pages.
Hugging Face
Hugging Face Spaces hosts the public demo, ZeroGPU supplies the GPU execution environment, and the Hub stores the code, article, demo assets, model metadata, and reproducible commit history.
Codex
Codex was used as the coding and documentation agent across the project. It helped implement and debug the app, refine prompts, update the story/image/audio generation instructions, prepare the technical article, maintain the README, create deployment commits, and publish updates to the Hugging Face Space.
Open Trace
Generation settings, timings, model state, and fallback information are visible in the app.
Off-Brand
The product has a custom scrapbook and crayon identity rather than a default Gradio appearance.
Field Notes
The repository includes a dedicated technical write-up on cross-page identity: Cross-Page Character Consistency Without Per-User Training.
14. Limitations
- Character consistency is strong but not mathematically guaranteed; complex doodles can drift.
- Six FLUX illustrations plus optional coloring redraws remain the longest stage.
- Zero-shot voice cloning quality depends on the clarity and duration of the uploaded recording.
- The current public UI is English-only.
- The story validator can verify structure more easily than literary quality.
- The current Space uses prompt and reference-image adaptation, not a deployed custom FLUX LoRA.
These are concrete engineering boundaries, not hidden behavior.
15. Roadmap
The next technically justified improvements are:
- Add automated story-quality scoring and one controlled rewrite pass.
- Add stronger visual identity evaluation between the canonical hero and each page.
- Complete and evaluate the crayon-style FLUX LoRA against the current prompt-only baseline.
- Reintroduce multilingual narration only after the language-specific path meets the same reliability and latency requirements as the English Space.
- Add explicit consent and retention controls around uploaded family voice references.
- Add a short shareable video export combining page turns, narration, and captions.
16. Demo Video
The published demo video is available here:
It walks through the complete process:
- Show the original child's doodle.
- Select the hero name, a theme, and a narrator.
- Start generation and show the streamed progress.
- Play a short narration excerpt.
- reveal the same hero across several illustrated pages.
- Download the story PDF.
- Show the matching coloring-book page.
- End on the 7B model stack and live Space URL.
The video is linked from both this article and the Space README.
Conclusion
DoodleBook is not simply text generation followed by image generation. It is one coordinated multimodal artifact built around a child's idea.
MiniCPM5-1B gives that idea narrative structure and meaning. FLUX.2-klein-4B preserves its visual identity across a complete book. VoxCPM2 gives it a voice, including the option of a familiar family voice without a training job. Hugging Face Spaces and ZeroGPU make the entire experience publicly accessible.
The result is small enough to fit the hackathon's technical constraints, but complete enough to feel magical: one drawing becomes a story a child can see, hear, print, keep, and make their own.