# DoodleBook: Turning a Child's Drawing into a Narrated, Illustrated Storybook **Build Small Hackathon 2026 - Adventure in Thousand Token Wood** **Live demo:** [build-small-hackathon/DoodleBook](https://huggingface.co/spaces/build-small-hackathon/DoodleBook) **Demo video:** [MP4 demo](demo-doodlebook.mp4) and [Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link) **Social post:** [X/Twitter announcement](https://x.com/sushruthsgowda/status/2066639063168225452?s=46) **Source:** [github.com/Sushruths04/Doodle-book](https://github.com/Sushruths04/Doodle-book) **License:** Apache-2.0 --- ## Abstract DoodleBook is a multimodal creative-learning application that turns one child's drawing into a complete six-page picture book. A child uploads a doodle, names the hero, selects one of ten meaningful themes, and chooses a narrator. The system writes an age-appropriate story, converts the doodle into a consistent storybook character, illustrates every page, narrates the story, exports a printable PDF, and can also produce a matching coloring book. The complete deployed model stack is only **7B parameters**: | Role | Model | Parameters | |---|---|---:| | Story author and scene planner | [`openbmb/MiniCPM5-1B`](https://huggingface.co/openbmb/MiniCPM5-1B) | 1B | | Expressive narration and zero-shot voice cloning | [`openbmb/VoxCPM2`](https://huggingface.co/openbmb/VoxCPM2) | 2B | | Character design, illustrations, and coloring-page redraws | [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) | 4B | The central product idea is that the child's drawing is not merely an input image. It is the identity anchor for the whole experience. The same hero appears in the narrative, the six illustrations, the narration, the story PDF, and the optional coloring book. --- ## 1. Objective Children rarely see their own imperfect drawings treated as finished creative work. Generative applications can produce polished images, but they often replace the child's idea rather than extend it. DoodleBook has a different objective: > Preserve the personality of a child's drawing and make that drawing the hero of a > meaningful story the child can read, hear, keep, print, and color. That objective creates five technical requirements: 1. The story must be joyful, understandable, and emotionally meaningful for a young child. 2. The selected theme must influence the actual lesson and plot, not just appear as a label. 3. The hero must remain visually recognizable across all six illustrations. 4. The selected narrator or uploaded family voice must match the final audio. 5. The complete pipeline must run as a public Hugging Face Space within ZeroGPU constraints. --- ## 2. The User Experience The current Space deliberately keeps the input small: 1. Upload or photograph a drawing. 2. Enter the hero's name. 3. Select one of ten story themes. 4. Select a narrator, including the optional **My Voice** mode. 5. Choose whether to generate a matching coloring book. 6. Select **Make my book**. The available themes are: - brave adventure - making a new friend - overcoming a fear - helping someone - lost and found - learning something new - kindness to animals - the magic of imagination - celebrating who you are - a rainy day adventure The narrator choices are Little Kid, Big Kid, Playful, Storyteller, Grandpa, and My Voice. For My Voice, the user uploads a short clear recording and VoxCPM2 performs zero-shot voice cloning from that reference. The result is revealed in a deliberate order: 1. Narration audio becomes available first. 2. The illustrated story PDF download appears next. 3. The full illustrated pages are revealed last. 4. If requested, the coloring-book preview and PDF are also produced. This order gives the child something to hear as soon as possible while the heavier document and page-rendering work completes. --- ## 3. End-to-End Architecture ```text Child's doodle + hero name + theme + narrator | v MiniCPM5-1B story generation title + character + text + scenes / \ / \ v v FLUX canonical hero VoxCPM2 narration | preset or cloned voice v FLUX six page renders | +-------+--------+ | | v v Illustrated PDF FLUX line-art redraw | v Coloring-book PDF ``` The pipeline is orchestrated in a Gradio 6 application hosted on Hugging Face Spaces. The models are loaded using the ZeroGPU-compatible module-scope pattern, and GPU-heavy stages are isolated behind `@spaces.GPU` functions. --- ## 4. MiniCPM5-1B: Story Author and Visual Planner MiniCPM5-1B does more than write prose. It produces a structured JSON plan containing: - a short book title - a reusable visual description of the hero - six page texts - six corresponding illustration scene descriptions Each scene description is consumed directly by the image pipeline, so the language model acts as both **author** and **visual director**. ### Theme-conditioned meaning Every selectable theme has its own narrative guidance. For example: - **Overcoming a fear** treats fear respectfully and shows courage as a supported small step. - **Making a new friend** rewards listening, sharing, or a sincere hello rather than popularity. - **Learning something new** includes an imperfect attempt, guidance, practice, and improvement. - **Celebrating who you are** focuses on self-acceptance without making the hero superior. - **Kindness to animals** emphasizes gentle, age-appropriate care and trusted adult support. This prevents ten buttons from producing the same generic adventure with different titles. ### Prompt design for a small model The prompt is detailed but organized into a small number of explicit responsibilities: 1. **Story quality:** one coherent arc, natural read-aloud language, dialogue, humor, and purposeful sound effects such as `BOOM!`, `WHOOSH!`, or `SPLASH!`. 2. **Emotional meaning:** the lesson must follow from what the hero actually chooses and does. 3. **Page progression:** introduction, first attempt, complication, realization, action, and emotional payoff. 4. **Visual continuity:** the hero remains active on every page and retains the same memorable physical traits. 5. **Illustration planning:** each scene describes one drawable moment with visible characters, action, setting, props, light, color mood, and emotion. 6. **Strict output:** valid JSON only, with exactly six page objects. A complete six-page exemplar establishes the quality target. The example demonstrates warmth, continuity, useful dialogue, sound effects, a safe treatment of fear, and a lesson shown through action rather than attached as an unrelated final sentence. ### Reliability The output parser extracts and repairs JSON when possible. If language-model generation fails, the application has local theme-based story arcs so the user still receives a book rather than a broken session. Errors and fallback information are exposed in the trace instead of being hidden. --- ## 5. FLUX.2-klein-4B: One Drawing, One Hero, Six Pages The hardest visual problem is cross-page character consistency. Six independent text-to-image requests tend to produce six different heroes. Color, clothing, face shape, body proportions, and accessories can drift enough to break the feeling that this is one book. DoodleBook solves this without training a new model for every child. ### Stage 1: canonical-character generation The uploaded doodle is passed through FLUX img2img once. The prompt instructs FLUX to: - preserve the creature or person type - preserve face, body shape, colors, markings, clothing, and accessories - clarify unclear lines without replacing the child's idea - produce one full-body hero on a neutral background - avoid extra characters, scenery, text, or duplicate views This output becomes a canonical model sheet for the book. ### Stage 2: reference-conditioned page generation Each story scene is rendered using the same canonical hero image. The page prompt explicitly locks: - face and species - body proportions - colors and markings - clothing and accessories - child-safe crayon storybook style The scene still changes from page to page, but the identity reference remains constant. Deterministic seeds provide reproducibility while page-specific seed offsets allow variation. If the user does not upload a doodle, the structured character description from MiniCPM5-1B becomes the text-based identity anchor. ### Why this approach matters Per-user LoRA training would add a training job before every book. That is too slow and operationally expensive for an interactive children's application. Canonical image conditioning provides personalization at inference time, using the child's actual drawing as the reference. --- ## 6. VoxCPM2: Expressive Narration and Custom Family Voices VoxCPM2 converts the title and all six page texts into one narrated story. The application splits the text into sentences, generates speech for each sentence, and inserts short pauses between them so the result sounds like a book being read rather than one continuous block. ### Narrator presets Each narrator option uses a separate voice-design instruction: - **Little Kid:** bright, playful, and full of wonder - **Big Kid:** confident, youthful, and energetic - **Playful:** animated delivery and light comic timing - **Storyteller:** soft, soothing bedtime pacing - **Grandpa:** warm, patient, and reassuring - **My Voice:** preserves the uploaded reference speaker while adding clear storytelling delivery The voice prompts also explain how to perform dialogue and sound effects. A bedtime voice softens `BOOM!`; a playful voice gives it energy without becoming harsh. ### My Voice: zero-shot cloning, not fine-tuning The deployed custom voice feature does **not** train a new audio model. VoxCPM2 is loaded with its denoiser enabled and receives the uploaded recording through `reference_wav_path`. It performs zero-shot voice cloning at inference time. This distinction is important: - no user-specific checkpoint is created - no training job is required - the user can hear a familiar family voice from one short reference recording - custom-voice narration is scheduled sequentially after images to avoid concurrent ZeroGPU contention Preset narration can run in parallel with illustration generation. Custom cloning is slower, so the number of cloned sentences is capped to fit the public GPU budget. --- ## 7. Fine-Tuning and Adaptation: What We Built DoodleBook uses several forms of model adaptation, but they should not be confused. ### Deployed in the current Space | Technique | Model | Status | |---|---|---| | Theme-specific structured prompting and few-shot story guidance | MiniCPM5-1B | Deployed | | Canonical-image conditioning from the child's doodle | FLUX.2-klein-4B | Deployed | | Deterministic seed and character-description identity anchors | FLUX.2-klein-4B | Deployed | | Voice-design prompting for six narrator styles | VoxCPM2 | Deployed | | Zero-shot voice cloning from uploaded reference audio | VoxCPM2 | Deployed | | Img2img semantic redraw for matching coloring pages | FLUX.2-klein-4B | Deployed | ### Fine-tuning work in the repository The repository also contains an experimental Kannada narration path referencing [`sush0401/IndicF5-Kannada-Bedtime-v2`](https://huggingface.co/sush0401/IndicF5-Kannada-Bedtime-v2), a bedtime-oriented IndicF5 checkpoint, with MMS-TTS and gTTS fallback tiers. This work explored language-specific expressive narration and reference-voice support. The current competition Space was simplified to one English storybook flow powered by VoxCPM2, so the Kannada checkpoint is **not loaded by the deployed `app.py` path**. It is supporting experimental work, not a dependency of the live demo. The repository also includes a FLUX LoRA training scaffold for a more uniform crayon style. No published LoRA is required by the current Space. The live visual consistency comes from canonical image conditioning, prompt constraints, and deterministic seeds. This transparent separation makes the deployed result reproducible and avoids overstating fine-tuning that is not active in the public demo. --- ## 8. Coloring Book: Redraw, Do Not Trace A naive coloring-book implementation applies edge detection to the finished crayon pages. That preserves every crayon grain, shadow, and background texture, producing noisy outlines that are unpleasant to color. DoodleBook instead sends each finished color page back to FLUX with an img2img prompt that asks for the same characters, action, emotion, and composition as clean line art. The model performs a semantic redraw rather than a literal texture trace. A lightweight local cleanup then: - converts the result to grayscale - increases contrast - thresholds it toward pure black and white - removes small speckles If the FLUX line-art pass fails, the application can fall back to local outline extraction. --- ## 9. PDFs and a Product Children Can Keep The generated book is not limited to a temporary browser result. DoodleBook creates: - a browser-readable illustrated book - a downloadable illustrated story PDF - narration audio - an optional browser-readable coloring book - an optional printable coloring-book PDF The PDF cover follows the same warm paper-and-crayon visual language as the application. This matters to the objective: the output becomes something a child can keep, share with family, read at bedtime, or color away from a screen. --- ## 10. ZeroGPU Engineering The public application runs on Hugging Face Spaces using one ZeroGPU Space. Important implementation decisions include: - loading the three primary models at module scope in the ZeroGPU-compatible pattern - separating story, image, coloring, and TTS work into GPU-decorated functions - using six-step FLUX generation at guidance scale 1.0 for interactive latency - running preset narration alongside illustration work - running custom voice cloning sequentially to avoid GPU contention - streaming heartbeat status updates during long generation stages - loading a pre-generated sample book immediately without requiring GPU allocation - preserving partial usability through story, image, and coloring fallback paths The UI is a custom Gradio Blocks interface rather than a default component stack. It uses paper textures, hand-drawn borders, crayon colors, and child-friendly typography while remaining usable on desktop and mobile. --- ## 11. Open Trace and Reproducibility Every generated book exposes a trace containing: - selected hero name - selected theme - selected narrator - coloring-book selection - model backend - character description - deterministic seed - story, image, narration, PDF, and coloring timings - rendering engine - surfaced load or generation errors - fallback details This serves two audiences. Developers can understand performance and failure modes, while judges and users can see that the result came from a reproducible pipeline rather than a hidden manual process. --- ## 12. Why the Model Stack Is Small but Effective DoodleBook's full stack is **7B parameters**, and its authoring plus performance core is only **3B**: - MiniCPM5-1B decides what happens and how each page should look. - VoxCPM2 performs the authored story and optionally adapts to a family voice. - FLUX.2-klein-4B renders the visual plan. This division of responsibility is the project's **Tiny Titan** argument. A large monolithic model is not necessary when small specialist models communicate through a precise intermediate representation. The JSON story object is that representation. It connects narrative text, character identity, page scenes, narration, browser rendering, and PDF export. --- ## 13. Hackathon Contributions ### Tiny Titan The complete application uses 7B parameters, with a 3B story-and-voice core. ### OpenBMB MiniCPM5-1B authors every live story and scene plan. VoxCPM2 performs every live narration and provides the zero-shot custom voice feature. ### Black Forest Labs FLUX.2-klein-4B creates the canonical hero, renders six story pages, and semantically redraws the optional coloring pages. ### Hugging Face Hugging Face Spaces hosts the public demo, ZeroGPU supplies the GPU execution environment, and the Hub stores the code, article, demo assets, model metadata, and reproducible commit history. ### Codex Codex was used as the coding and documentation agent across the project. It helped implement and debug the app, refine prompts, update the story/image/audio generation instructions, prepare the technical article, maintain the README, create deployment commits, and publish updates to the Hugging Face Space. ### Open Trace Generation settings, timings, model state, and fallback information are visible in the app. ### Off-Brand The product has a custom scrapbook and crayon identity rather than a default Gradio appearance. ### Field Notes The repository includes a dedicated technical write-up on cross-page identity: [Cross-Page Character Consistency Without Per-User Training](blog.md). --- ## 14. Limitations - Character consistency is strong but not mathematically guaranteed; complex doodles can drift. - Six FLUX illustrations plus optional coloring redraws remain the longest stage. - Zero-shot voice cloning quality depends on the clarity and duration of the uploaded recording. - The current public UI is English-only. - The story validator can verify structure more easily than literary quality. - The current Space uses prompt and reference-image adaptation, not a deployed custom FLUX LoRA. These are concrete engineering boundaries, not hidden behavior. --- ## 15. Roadmap The next technically justified improvements are: 1. Add automated story-quality scoring and one controlled rewrite pass. 2. Add stronger visual identity evaluation between the canonical hero and each page. 3. Complete and evaluate the crayon-style FLUX LoRA against the current prompt-only baseline. 4. Reintroduce multilingual narration only after the language-specific path meets the same reliability and latency requirements as the English Space. 5. Add explicit consent and retention controls around uploaded family voice references. 6. Add a short shareable video export combining page turns, narration, and captions. --- ## 16. Demo Video The published demo video is available here: **[Watch the MP4 demo](demo-doodlebook.mp4)** **[Open the Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link)** **[View the X/Twitter post](https://x.com/sushruthsgowda/status/2066639063168225452?s=46)** It walks through the complete process: 1. Show the original child's doodle. 2. Select the hero name, a theme, and a narrator. 3. Start generation and show the streamed progress. 4. Play a short narration excerpt. 5. reveal the same hero across several illustrated pages. 6. Download the story PDF. 7. Show the matching coloring-book page. 8. End on the 7B model stack and live Space URL. The video is linked from both this article and the Space README. --- ## Conclusion DoodleBook is not simply text generation followed by image generation. It is one coordinated multimodal artifact built around a child's idea. MiniCPM5-1B gives that idea narrative structure and meaning. FLUX.2-klein-4B preserves its visual identity across a complete book. VoxCPM2 gives it a voice, including the option of a familiar family voice without a training job. Hugging Face Spaces and ZeroGPU make the entire experience publicly accessible. The result is small enough to fit the hackathon's technical constraints, but complete enough to feel magical: one drawing becomes a story a child can see, hear, print, keep, and make their own.