Spaces:
Running on Zero
Running on Zero
| # DoodleBook: Turning a Child's Drawing into a Narrated, Illustrated Storybook | |
| **Build Small Hackathon 2026 - Adventure in Thousand Token Wood** | |
| **Live demo:** [build-small-hackathon/DoodleBook](https://huggingface.co/spaces/build-small-hackathon/DoodleBook) | |
| **Demo video:** [MP4 demo](demo-doodlebook.mp4) and [Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link) | |
| **Social post:** [X/Twitter announcement](https://x.com/sushruthsgowda/status/2066639063168225452?s=46) | |
| **Source:** [github.com/Sushruths04/Doodle-book](https://github.com/Sushruths04/Doodle-book) | |
| **License:** Apache-2.0 | |
| --- | |
| ## Abstract | |
| DoodleBook is a multimodal creative-learning application that turns one child's drawing | |
| into a complete six-page picture book. A child uploads a doodle, names the hero, selects | |
| one of ten meaningful themes, and chooses a narrator. The system writes an age-appropriate | |
| story, converts the doodle into a consistent storybook character, illustrates every page, | |
| narrates the story, exports a printable PDF, and can also produce a matching coloring book. | |
| The complete deployed model stack is only **7B parameters**: | |
| | Role | Model | Parameters | | |
| |---|---|---:| | |
| | Story author and scene planner | [`openbmb/MiniCPM5-1B`](https://huggingface.co/openbmb/MiniCPM5-1B) | 1B | | |
| | Expressive narration and zero-shot voice cloning | [`openbmb/VoxCPM2`](https://huggingface.co/openbmb/VoxCPM2) | 2B | | |
| | Character design, illustrations, and coloring-page redraws | [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) | 4B | | |
| The central product idea is that the child's drawing is not merely an input image. It is | |
| the identity anchor for the whole experience. The same hero appears in the narrative, the | |
| six illustrations, the narration, the story PDF, and the optional coloring book. | |
| --- | |
| ## 1. Objective | |
| Children rarely see their own imperfect drawings treated as finished creative work. | |
| Generative applications can produce polished images, but they often replace the child's | |
| idea rather than extend it. DoodleBook has a different objective: | |
| > Preserve the personality of a child's drawing and make that drawing the hero of a | |
| > meaningful story the child can read, hear, keep, print, and color. | |
| That objective creates five technical requirements: | |
| 1. The story must be joyful, understandable, and emotionally meaningful for a young child. | |
| 2. The selected theme must influence the actual lesson and plot, not just appear as a label. | |
| 3. The hero must remain visually recognizable across all six illustrations. | |
| 4. The selected narrator or uploaded family voice must match the final audio. | |
| 5. The complete pipeline must run as a public Hugging Face Space within ZeroGPU constraints. | |
| --- | |
| ## 2. The User Experience | |
| The current Space deliberately keeps the input small: | |
| 1. Upload or photograph a drawing. | |
| 2. Enter the hero's name. | |
| 3. Select one of ten story themes. | |
| 4. Select a narrator, including the optional **My Voice** mode. | |
| 5. Choose whether to generate a matching coloring book. | |
| 6. Select **Make my book**. | |
| The available themes are: | |
| - brave adventure | |
| - making a new friend | |
| - overcoming a fear | |
| - helping someone | |
| - lost and found | |
| - learning something new | |
| - kindness to animals | |
| - the magic of imagination | |
| - celebrating who you are | |
| - a rainy day adventure | |
| The narrator choices are Little Kid, Big Kid, Playful, Storyteller, Grandpa, and My Voice. | |
| For My Voice, the user uploads a short clear recording and VoxCPM2 performs zero-shot voice | |
| cloning from that reference. | |
| The result is revealed in a deliberate order: | |
| 1. Narration audio becomes available first. | |
| 2. The illustrated story PDF download appears next. | |
| 3. The full illustrated pages are revealed last. | |
| 4. If requested, the coloring-book preview and PDF are also produced. | |
| This order gives the child something to hear as soon as possible while the heavier document | |
| and page-rendering work completes. | |
| --- | |
| ## 3. End-to-End Architecture | |
| ```text | |
| Child's doodle + hero name + theme + narrator | |
| | | |
| v | |
| MiniCPM5-1B story generation | |
| title + character + text + scenes | |
| / \ | |
| / \ | |
| v v | |
| FLUX canonical hero VoxCPM2 narration | |
| | preset or cloned voice | |
| v | |
| FLUX six page renders | |
| | | |
| +-------+--------+ | |
| | | | |
| v v | |
| Illustrated PDF FLUX line-art redraw | |
| | | |
| v | |
| Coloring-book PDF | |
| ``` | |
| The pipeline is orchestrated in a Gradio 6 application hosted on Hugging Face Spaces. | |
| The models are loaded using the ZeroGPU-compatible module-scope pattern, and GPU-heavy | |
| stages are isolated behind `@spaces.GPU` functions. | |
| --- | |
| ## 4. MiniCPM5-1B: Story Author and Visual Planner | |
| MiniCPM5-1B does more than write prose. It produces a structured JSON plan containing: | |
| - a short book title | |
| - a reusable visual description of the hero | |
| - six page texts | |
| - six corresponding illustration scene descriptions | |
| Each scene description is consumed directly by the image pipeline, so the language model | |
| acts as both **author** and **visual director**. | |
| ### Theme-conditioned meaning | |
| Every selectable theme has its own narrative guidance. For example: | |
| - **Overcoming a fear** treats fear respectfully and shows courage as a supported small step. | |
| - **Making a new friend** rewards listening, sharing, or a sincere hello rather than popularity. | |
| - **Learning something new** includes an imperfect attempt, guidance, practice, and improvement. | |
| - **Celebrating who you are** focuses on self-acceptance without making the hero superior. | |
| - **Kindness to animals** emphasizes gentle, age-appropriate care and trusted adult support. | |
| This prevents ten buttons from producing the same generic adventure with different titles. | |
| ### Prompt design for a small model | |
| The prompt is detailed but organized into a small number of explicit responsibilities: | |
| 1. **Story quality:** one coherent arc, natural read-aloud language, dialogue, humor, and | |
| purposeful sound effects such as `BOOM!`, `WHOOSH!`, or `SPLASH!`. | |
| 2. **Emotional meaning:** the lesson must follow from what the hero actually chooses and does. | |
| 3. **Page progression:** introduction, first attempt, complication, realization, action, | |
| and emotional payoff. | |
| 4. **Visual continuity:** the hero remains active on every page and retains the same memorable | |
| physical traits. | |
| 5. **Illustration planning:** each scene describes one drawable moment with visible characters, | |
| action, setting, props, light, color mood, and emotion. | |
| 6. **Strict output:** valid JSON only, with exactly six page objects. | |
| A complete six-page exemplar establishes the quality target. The example demonstrates warmth, | |
| continuity, useful dialogue, sound effects, a safe treatment of fear, and a lesson shown through | |
| action rather than attached as an unrelated final sentence. | |
| ### Reliability | |
| The output parser extracts and repairs JSON when possible. If language-model generation fails, | |
| the application has local theme-based story arcs so the user still receives a book rather than | |
| a broken session. Errors and fallback information are exposed in the trace instead of being | |
| hidden. | |
| --- | |
| ## 5. FLUX.2-klein-4B: One Drawing, One Hero, Six Pages | |
| The hardest visual problem is cross-page character consistency. Six independent text-to-image | |
| requests tend to produce six different heroes. Color, clothing, face shape, body proportions, | |
| and accessories can drift enough to break the feeling that this is one book. | |
| DoodleBook solves this without training a new model for every child. | |
| ### Stage 1: canonical-character generation | |
| The uploaded doodle is passed through FLUX img2img once. The prompt instructs FLUX to: | |
| - preserve the creature or person type | |
| - preserve face, body shape, colors, markings, clothing, and accessories | |
| - clarify unclear lines without replacing the child's idea | |
| - produce one full-body hero on a neutral background | |
| - avoid extra characters, scenery, text, or duplicate views | |
| This output becomes a canonical model sheet for the book. | |
| ### Stage 2: reference-conditioned page generation | |
| Each story scene is rendered using the same canonical hero image. The page prompt explicitly | |
| locks: | |
| - face and species | |
| - body proportions | |
| - colors and markings | |
| - clothing and accessories | |
| - child-safe crayon storybook style | |
| The scene still changes from page to page, but the identity reference remains constant. | |
| Deterministic seeds provide reproducibility while page-specific seed offsets allow variation. | |
| If the user does not upload a doodle, the structured character description from MiniCPM5-1B | |
| becomes the text-based identity anchor. | |
| ### Why this approach matters | |
| Per-user LoRA training would add a training job before every book. That is too slow and | |
| operationally expensive for an interactive children's application. Canonical image conditioning | |
| provides personalization at inference time, using the child's actual drawing as the reference. | |
| --- | |
| ## 6. VoxCPM2: Expressive Narration and Custom Family Voices | |
| VoxCPM2 converts the title and all six page texts into one narrated story. The application | |
| splits the text into sentences, generates speech for each sentence, and inserts short pauses | |
| between them so the result sounds like a book being read rather than one continuous block. | |
| ### Narrator presets | |
| Each narrator option uses a separate voice-design instruction: | |
| - **Little Kid:** bright, playful, and full of wonder | |
| - **Big Kid:** confident, youthful, and energetic | |
| - **Playful:** animated delivery and light comic timing | |
| - **Storyteller:** soft, soothing bedtime pacing | |
| - **Grandpa:** warm, patient, and reassuring | |
| - **My Voice:** preserves the uploaded reference speaker while adding clear storytelling delivery | |
| The voice prompts also explain how to perform dialogue and sound effects. A bedtime voice softens | |
| `BOOM!`; a playful voice gives it energy without becoming harsh. | |
| ### My Voice: zero-shot cloning, not fine-tuning | |
| The deployed custom voice feature does **not** train a new audio model. VoxCPM2 is loaded with its | |
| denoiser enabled and receives the uploaded recording through `reference_wav_path`. It performs | |
| zero-shot voice cloning at inference time. | |
| This distinction is important: | |
| - no user-specific checkpoint is created | |
| - no training job is required | |
| - the user can hear a familiar family voice from one short reference recording | |
| - custom-voice narration is scheduled sequentially after images to avoid concurrent ZeroGPU | |
| contention | |
| Preset narration can run in parallel with illustration generation. Custom cloning is slower, so | |
| the number of cloned sentences is capped to fit the public GPU budget. | |
| --- | |
| ## 7. Fine-Tuning and Adaptation: What We Built | |
| DoodleBook uses several forms of model adaptation, but they should not be confused. | |
| ### Deployed in the current Space | |
| | Technique | Model | Status | | |
| |---|---|---| | |
| | Theme-specific structured prompting and few-shot story guidance | MiniCPM5-1B | Deployed | | |
| | Canonical-image conditioning from the child's doodle | FLUX.2-klein-4B | Deployed | | |
| | Deterministic seed and character-description identity anchors | FLUX.2-klein-4B | Deployed | | |
| | Voice-design prompting for six narrator styles | VoxCPM2 | Deployed | | |
| | Zero-shot voice cloning from uploaded reference audio | VoxCPM2 | Deployed | | |
| | Img2img semantic redraw for matching coloring pages | FLUX.2-klein-4B | Deployed | | |
| ### Fine-tuning work in the repository | |
| The repository also contains an experimental Kannada narration path referencing | |
| [`sush0401/IndicF5-Kannada-Bedtime-v2`](https://huggingface.co/sush0401/IndicF5-Kannada-Bedtime-v2), | |
| a bedtime-oriented IndicF5 checkpoint, with MMS-TTS and gTTS fallback tiers. This work explored | |
| language-specific expressive narration and reference-voice support. | |
| The current competition Space was simplified to one English storybook flow powered by VoxCPM2, | |
| so the Kannada checkpoint is **not loaded by the deployed `app.py` path**. It is supporting | |
| experimental work, not a dependency of the live demo. | |
| The repository also includes a FLUX LoRA training scaffold for a more uniform crayon style. | |
| No published LoRA is required by the current Space. The live visual consistency comes from | |
| canonical image conditioning, prompt constraints, and deterministic seeds. | |
| This transparent separation makes the deployed result reproducible and avoids overstating | |
| fine-tuning that is not active in the public demo. | |
| --- | |
| ## 8. Coloring Book: Redraw, Do Not Trace | |
| A naive coloring-book implementation applies edge detection to the finished crayon pages. | |
| That preserves every crayon grain, shadow, and background texture, producing noisy outlines | |
| that are unpleasant to color. | |
| DoodleBook instead sends each finished color page back to FLUX with an img2img prompt that asks | |
| for the same characters, action, emotion, and composition as clean line art. The model performs | |
| a semantic redraw rather than a literal texture trace. | |
| A lightweight local cleanup then: | |
| - converts the result to grayscale | |
| - increases contrast | |
| - thresholds it toward pure black and white | |
| - removes small speckles | |
| If the FLUX line-art pass fails, the application can fall back to local outline extraction. | |
| --- | |
| ## 9. PDFs and a Product Children Can Keep | |
| The generated book is not limited to a temporary browser result. DoodleBook creates: | |
| - a browser-readable illustrated book | |
| - a downloadable illustrated story PDF | |
| - narration audio | |
| - an optional browser-readable coloring book | |
| - an optional printable coloring-book PDF | |
| The PDF cover follows the same warm paper-and-crayon visual language as the application. | |
| This matters to the objective: the output becomes something a child can keep, share with family, | |
| read at bedtime, or color away from a screen. | |
| --- | |
| ## 10. ZeroGPU Engineering | |
| The public application runs on Hugging Face Spaces using one ZeroGPU Space. | |
| Important implementation decisions include: | |
| - loading the three primary models at module scope in the ZeroGPU-compatible pattern | |
| - separating story, image, coloring, and TTS work into GPU-decorated functions | |
| - using six-step FLUX generation at guidance scale 1.0 for interactive latency | |
| - running preset narration alongside illustration work | |
| - running custom voice cloning sequentially to avoid GPU contention | |
| - streaming heartbeat status updates during long generation stages | |
| - loading a pre-generated sample book immediately without requiring GPU allocation | |
| - preserving partial usability through story, image, and coloring fallback paths | |
| The UI is a custom Gradio Blocks interface rather than a default component stack. It uses | |
| paper textures, hand-drawn borders, crayon colors, and child-friendly typography while remaining | |
| usable on desktop and mobile. | |
| --- | |
| ## 11. Open Trace and Reproducibility | |
| Every generated book exposes a trace containing: | |
| - selected hero name | |
| - selected theme | |
| - selected narrator | |
| - coloring-book selection | |
| - model backend | |
| - character description | |
| - deterministic seed | |
| - story, image, narration, PDF, and coloring timings | |
| - rendering engine | |
| - surfaced load or generation errors | |
| - fallback details | |
| This serves two audiences. Developers can understand performance and failure modes, while judges | |
| and users can see that the result came from a reproducible pipeline rather than a hidden manual | |
| process. | |
| --- | |
| ## 12. Why the Model Stack Is Small but Effective | |
| DoodleBook's full stack is **7B parameters**, and its authoring plus performance core is only | |
| **3B**: | |
| - MiniCPM5-1B decides what happens and how each page should look. | |
| - VoxCPM2 performs the authored story and optionally adapts to a family voice. | |
| - FLUX.2-klein-4B renders the visual plan. | |
| This division of responsibility is the project's **Tiny Titan** argument. A large monolithic | |
| model is not necessary when small specialist models communicate through a precise intermediate | |
| representation. | |
| The JSON story object is that representation. It connects narrative text, character identity, | |
| page scenes, narration, browser rendering, and PDF export. | |
| --- | |
| ## 13. Hackathon Contributions | |
| ### Tiny Titan | |
| The complete application uses 7B parameters, with a 3B story-and-voice core. | |
| ### OpenBMB | |
| MiniCPM5-1B authors every live story and scene plan. VoxCPM2 performs every live narration and | |
| provides the zero-shot custom voice feature. | |
| ### Black Forest Labs | |
| FLUX.2-klein-4B creates the canonical hero, renders six story pages, and semantically redraws the | |
| optional coloring pages. | |
| ### Hugging Face | |
| Hugging Face Spaces hosts the public demo, ZeroGPU supplies the GPU execution environment, and | |
| the Hub stores the code, article, demo assets, model metadata, and reproducible commit history. | |
| ### Codex | |
| Codex was used as the coding and documentation agent across the project. It helped implement | |
| and debug the app, refine prompts, update the story/image/audio generation instructions, prepare | |
| the technical article, maintain the README, create deployment commits, and publish updates to | |
| the Hugging Face Space. | |
| ### Open Trace | |
| Generation settings, timings, model state, and fallback information are visible in the app. | |
| ### Off-Brand | |
| The product has a custom scrapbook and crayon identity rather than a default Gradio appearance. | |
| ### Field Notes | |
| The repository includes a dedicated technical write-up on cross-page identity: | |
| [Cross-Page Character Consistency Without Per-User Training](blog.md). | |
| --- | |
| ## 14. Limitations | |
| - Character consistency is strong but not mathematically guaranteed; complex doodles can drift. | |
| - Six FLUX illustrations plus optional coloring redraws remain the longest stage. | |
| - Zero-shot voice cloning quality depends on the clarity and duration of the uploaded recording. | |
| - The current public UI is English-only. | |
| - The story validator can verify structure more easily than literary quality. | |
| - The current Space uses prompt and reference-image adaptation, not a deployed custom FLUX LoRA. | |
| These are concrete engineering boundaries, not hidden behavior. | |
| --- | |
| ## 15. Roadmap | |
| The next technically justified improvements are: | |
| 1. Add automated story-quality scoring and one controlled rewrite pass. | |
| 2. Add stronger visual identity evaluation between the canonical hero and each page. | |
| 3. Complete and evaluate the crayon-style FLUX LoRA against the current prompt-only baseline. | |
| 4. Reintroduce multilingual narration only after the language-specific path meets the same | |
| reliability and latency requirements as the English Space. | |
| 5. Add explicit consent and retention controls around uploaded family voice references. | |
| 6. Add a short shareable video export combining page turns, narration, and captions. | |
| --- | |
| ## 16. Demo Video | |
| The published demo video is available here: | |
| **[Watch the MP4 demo](demo-doodlebook.mp4)** | |
| **[Open the Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link)** | |
| **[View the X/Twitter post](https://x.com/sushruthsgowda/status/2066639063168225452?s=46)** | |
| It walks through the complete process: | |
| 1. Show the original child's doodle. | |
| 2. Select the hero name, a theme, and a narrator. | |
| 3. Start generation and show the streamed progress. | |
| 4. Play a short narration excerpt. | |
| 5. reveal the same hero across several illustrated pages. | |
| 6. Download the story PDF. | |
| 7. Show the matching coloring-book page. | |
| 8. End on the 7B model stack and live Space URL. | |
| The video is linked from both this article and the Space README. | |
| --- | |
| ## Conclusion | |
| DoodleBook is not simply text generation followed by image generation. It is one coordinated | |
| multimodal artifact built around a child's idea. | |
| MiniCPM5-1B gives that idea narrative structure and meaning. FLUX.2-klein-4B preserves its visual | |
| identity across a complete book. VoxCPM2 gives it a voice, including the option of a familiar | |
| family voice without a training job. Hugging Face Spaces and ZeroGPU make the entire experience | |
| publicly accessible. | |
| The result is small enough to fit the hackathon's technical constraints, but complete enough to | |
| feel magical: one drawing becomes a story a child can see, hear, print, keep, and make their own. | |