Spaces:

build-small-hackathon
/

DoodleBook

Running on Zero

App Files Files Community

DoodleBook / docs /article.md

Codex

Add demo video and sponsor publication links

be8ab4a 14 days ago

preview code

Raw

History Blame Contribute Delete

20.2 kB

	# DoodleBook: Turning a Child's Drawing into a Narrated, Illustrated Storybook

	Build Small Hackathon 2026 - Adventure in Thousand Token Wood

	Live demo: [build-small-hackathon/DoodleBook](https://huggingface.co/spaces/build-small-hackathon/DoodleBook)

	Demo video: [MP4 demo](demo-doodlebook.mp4) and [Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link)

	Social post: [X/Twitter announcement](https://x.com/sushruthsgowda/status/2066639063168225452?s=46)

	Source: [github.com/Sushruths04/Doodle-book](https://github.com/Sushruths04/Doodle-book)

	License: Apache-2.0

	---

	## Abstract

	DoodleBook is a multimodal creative-learning application that turns one child's drawing
	into a complete six-page picture book. A child uploads a doodle, names the hero, selects
	one of ten meaningful themes, and chooses a narrator. The system writes an age-appropriate
	story, converts the doodle into a consistent storybook character, illustrates every page,
	narrates the story, exports a printable PDF, and can also produce a matching coloring book.

	The complete deployed model stack is only 7B parameters:

	\| Role \| Model \| Parameters \|
	\|---\|---\|---:\|
	\| Story author and scene planner \| [`openbmb/MiniCPM5-1B`](https://huggingface.co/openbmb/MiniCPM5-1B) \| 1B \|
	\| Expressive narration and zero-shot voice cloning \| [`openbmb/VoxCPM2`](https://huggingface.co/openbmb/VoxCPM2) \| 2B \|
	\| Character design, illustrations, and coloring-page redraws \| [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) \| 4B \|

	The central product idea is that the child's drawing is not merely an input image. It is
	the identity anchor for the whole experience. The same hero appears in the narrative, the
	six illustrations, the narration, the story PDF, and the optional coloring book.

	---

	## 1. Objective

	Children rarely see their own imperfect drawings treated as finished creative work.
	Generative applications can produce polished images, but they often replace the child's
	idea rather than extend it. DoodleBook has a different objective:

	> Preserve the personality of a child's drawing and make that drawing the hero of a
	> meaningful story the child can read, hear, keep, print, and color.

	That objective creates five technical requirements:

	1. The story must be joyful, understandable, and emotionally meaningful for a young child.
	2. The selected theme must influence the actual lesson and plot, not just appear as a label.
	3. The hero must remain visually recognizable across all six illustrations.
	4. The selected narrator or uploaded family voice must match the final audio.
	5. The complete pipeline must run as a public Hugging Face Space within ZeroGPU constraints.

	---

	## 2. The User Experience

	The current Space deliberately keeps the input small:

	1. Upload or photograph a drawing.
	2. Enter the hero's name.
	3. Select one of ten story themes.
	4. Select a narrator, including the optional My Voice mode.
	5. Choose whether to generate a matching coloring book.
	6. Select Make my book.

	The available themes are:

	- brave adventure
	- making a new friend
	- overcoming a fear
	- helping someone
	- lost and found
	- learning something new
	- kindness to animals
	- the magic of imagination
	- celebrating who you are
	- a rainy day adventure

	The narrator choices are Little Kid, Big Kid, Playful, Storyteller, Grandpa, and My Voice.
	For My Voice, the user uploads a short clear recording and VoxCPM2 performs zero-shot voice
	cloning from that reference.

	The result is revealed in a deliberate order:

	1. Narration audio becomes available first.
	2. The illustrated story PDF download appears next.
	3. The full illustrated pages are revealed last.
	4. If requested, the coloring-book preview and PDF are also produced.

	This order gives the child something to hear as soon as possible while the heavier document
	and page-rendering work completes.

	---

	## 3. End-to-End Architecture

	```text
	Child's doodle + hero name + theme + narrator
	\|
	v
	MiniCPM5-1B story generation
	title + character + text + scenes
	/ \
	/ \
	v v
	FLUX canonical hero VoxCPM2 narration
	\| preset or cloned voice
	v
	FLUX six page renders
	\|
	+-------+--------+
	\| \|
	v v
	Illustrated PDF FLUX line-art redraw
	\|
	v
	Coloring-book PDF
	```

	The pipeline is orchestrated in a Gradio 6 application hosted on Hugging Face Spaces.
	The models are loaded using the ZeroGPU-compatible module-scope pattern, and GPU-heavy
	stages are isolated behind `@spaces.GPU` functions.

	---

	## 4. MiniCPM5-1B: Story Author and Visual Planner

	MiniCPM5-1B does more than write prose. It produces a structured JSON plan containing:

	- a short book title
	- a reusable visual description of the hero
	- six page texts
	- six corresponding illustration scene descriptions

	Each scene description is consumed directly by the image pipeline, so the language model
	acts as both author and visual director.

	### Theme-conditioned meaning

	Every selectable theme has its own narrative guidance. For example:

	- Overcoming a fear treats fear respectfully and shows courage as a supported small step.
	- Making a new friend rewards listening, sharing, or a sincere hello rather than popularity.
	- Learning something new includes an imperfect attempt, guidance, practice, and improvement.
	- Celebrating who you are focuses on self-acceptance without making the hero superior.
	- Kindness to animals emphasizes gentle, age-appropriate care and trusted adult support.

	This prevents ten buttons from producing the same generic adventure with different titles.

	### Prompt design for a small model

	The prompt is detailed but organized into a small number of explicit responsibilities:

	1. Story quality: one coherent arc, natural read-aloud language, dialogue, humor, and
	purposeful sound effects such as `BOOM!`, `WHOOSH!`, or `SPLASH!`.
	2. Emotional meaning: the lesson must follow from what the hero actually chooses and does.
	3. Page progression: introduction, first attempt, complication, realization, action,
	and emotional payoff.
	4. Visual continuity: the hero remains active on every page and retains the same memorable
	physical traits.
	5. Illustration planning: each scene describes one drawable moment with visible characters,
	action, setting, props, light, color mood, and emotion.
	6. Strict output: valid JSON only, with exactly six page objects.

	A complete six-page exemplar establishes the quality target. The example demonstrates warmth,
	continuity, useful dialogue, sound effects, a safe treatment of fear, and a lesson shown through
	action rather than attached as an unrelated final sentence.

	### Reliability

	The output parser extracts and repairs JSON when possible. If language-model generation fails,
	the application has local theme-based story arcs so the user still receives a book rather than
	a broken session. Errors and fallback information are exposed in the trace instead of being
	hidden.

	---

	## 5. FLUX.2-klein-4B: One Drawing, One Hero, Six Pages

	The hardest visual problem is cross-page character consistency. Six independent text-to-image
	requests tend to produce six different heroes. Color, clothing, face shape, body proportions,
	and accessories can drift enough to break the feeling that this is one book.

	DoodleBook solves this without training a new model for every child.

	### Stage 1: canonical-character generation

	The uploaded doodle is passed through FLUX img2img once. The prompt instructs FLUX to:

	- preserve the creature or person type
	- preserve face, body shape, colors, markings, clothing, and accessories
	- clarify unclear lines without replacing the child's idea
	- produce one full-body hero on a neutral background
	- avoid extra characters, scenery, text, or duplicate views

	This output becomes a canonical model sheet for the book.

	### Stage 2: reference-conditioned page generation

	Each story scene is rendered using the same canonical hero image. The page prompt explicitly
	locks:

	- face and species
	- body proportions
	- colors and markings
	- clothing and accessories
	- child-safe crayon storybook style

	The scene still changes from page to page, but the identity reference remains constant.
	Deterministic seeds provide reproducibility while page-specific seed offsets allow variation.

	If the user does not upload a doodle, the structured character description from MiniCPM5-1B
	becomes the text-based identity anchor.

	### Why this approach matters

	Per-user LoRA training would add a training job before every book. That is too slow and
	operationally expensive for an interactive children's application. Canonical image conditioning
	provides personalization at inference time, using the child's actual drawing as the reference.

	---

	## 6. VoxCPM2: Expressive Narration and Custom Family Voices

	VoxCPM2 converts the title and all six page texts into one narrated story. The application
	splits the text into sentences, generates speech for each sentence, and inserts short pauses
	between them so the result sounds like a book being read rather than one continuous block.

	### Narrator presets

	Each narrator option uses a separate voice-design instruction:

	- Little Kid: bright, playful, and full of wonder
	- Big Kid: confident, youthful, and energetic
	- Playful: animated delivery and light comic timing
	- Storyteller: soft, soothing bedtime pacing
	- Grandpa: warm, patient, and reassuring
	- My Voice: preserves the uploaded reference speaker while adding clear storytelling delivery

	The voice prompts also explain how to perform dialogue and sound effects. A bedtime voice softens
	`BOOM!`; a playful voice gives it energy without becoming harsh.

	### My Voice: zero-shot cloning, not fine-tuning

	The deployed custom voice feature does not train a new audio model. VoxCPM2 is loaded with its
	denoiser enabled and receives the uploaded recording through `reference_wav_path`. It performs
	zero-shot voice cloning at inference time.

	This distinction is important:

	- no user-specific checkpoint is created
	- no training job is required
	- the user can hear a familiar family voice from one short reference recording
	- custom-voice narration is scheduled sequentially after images to avoid concurrent ZeroGPU
	contention

	Preset narration can run in parallel with illustration generation. Custom cloning is slower, so
	the number of cloned sentences is capped to fit the public GPU budget.

	---

	## 7. Fine-Tuning and Adaptation: What We Built

	DoodleBook uses several forms of model adaptation, but they should not be confused.

	### Deployed in the current Space

	\| Technique \| Model \| Status \|
	\|---\|---\|---\|
	\| Theme-specific structured prompting and few-shot story guidance \| MiniCPM5-1B \| Deployed \|
	\| Canonical-image conditioning from the child's doodle \| FLUX.2-klein-4B \| Deployed \|
	\| Deterministic seed and character-description identity anchors \| FLUX.2-klein-4B \| Deployed \|
	\| Voice-design prompting for six narrator styles \| VoxCPM2 \| Deployed \|
	\| Zero-shot voice cloning from uploaded reference audio \| VoxCPM2 \| Deployed \|
	\| Img2img semantic redraw for matching coloring pages \| FLUX.2-klein-4B \| Deployed \|

	### Fine-tuning work in the repository

	The repository also contains an experimental Kannada narration path referencing
	[`sush0401/IndicF5-Kannada-Bedtime-v2`](https://huggingface.co/sush0401/IndicF5-Kannada-Bedtime-v2),
	a bedtime-oriented IndicF5 checkpoint, with MMS-TTS and gTTS fallback tiers. This work explored
	language-specific expressive narration and reference-voice support.

	The current competition Space was simplified to one English storybook flow powered by VoxCPM2,
	so the Kannada checkpoint is not loaded by the deployed `app.py` path. It is supporting
	experimental work, not a dependency of the live demo.

	The repository also includes a FLUX LoRA training scaffold for a more uniform crayon style.
	No published LoRA is required by the current Space. The live visual consistency comes from
	canonical image conditioning, prompt constraints, and deterministic seeds.

	This transparent separation makes the deployed result reproducible and avoids overstating
	fine-tuning that is not active in the public demo.

	---

	## 8. Coloring Book: Redraw, Do Not Trace

	A naive coloring-book implementation applies edge detection to the finished crayon pages.
	That preserves every crayon grain, shadow, and background texture, producing noisy outlines
	that are unpleasant to color.

	DoodleBook instead sends each finished color page back to FLUX with an img2img prompt that asks
	for the same characters, action, emotion, and composition as clean line art. The model performs
	a semantic redraw rather than a literal texture trace.

	A lightweight local cleanup then:

	- converts the result to grayscale
	- increases contrast
	- thresholds it toward pure black and white
	- removes small speckles

	If the FLUX line-art pass fails, the application can fall back to local outline extraction.

	---

	## 9. PDFs and a Product Children Can Keep

	The generated book is not limited to a temporary browser result. DoodleBook creates:

	- a browser-readable illustrated book
	- a downloadable illustrated story PDF
	- narration audio
	- an optional browser-readable coloring book
	- an optional printable coloring-book PDF

	The PDF cover follows the same warm paper-and-crayon visual language as the application.
	This matters to the objective: the output becomes something a child can keep, share with family,
	read at bedtime, or color away from a screen.

	---

	## 10. ZeroGPU Engineering

	The public application runs on Hugging Face Spaces using one ZeroGPU Space.

	Important implementation decisions include:

	- loading the three primary models at module scope in the ZeroGPU-compatible pattern
	- separating story, image, coloring, and TTS work into GPU-decorated functions
	- using six-step FLUX generation at guidance scale 1.0 for interactive latency
	- running preset narration alongside illustration work
	- running custom voice cloning sequentially to avoid GPU contention
	- streaming heartbeat status updates during long generation stages
	- loading a pre-generated sample book immediately without requiring GPU allocation
	- preserving partial usability through story, image, and coloring fallback paths

	The UI is a custom Gradio Blocks interface rather than a default component stack. It uses
	paper textures, hand-drawn borders, crayon colors, and child-friendly typography while remaining
	usable on desktop and mobile.

	---

	## 11. Open Trace and Reproducibility

	Every generated book exposes a trace containing:

	- selected hero name
	- selected theme
	- selected narrator
	- coloring-book selection
	- model backend
	- character description
	- deterministic seed
	- story, image, narration, PDF, and coloring timings
	- rendering engine
	- surfaced load or generation errors
	- fallback details

	This serves two audiences. Developers can understand performance and failure modes, while judges
	and users can see that the result came from a reproducible pipeline rather than a hidden manual
	process.

	---

	## 12. Why the Model Stack Is Small but Effective

	DoodleBook's full stack is 7B parameters, and its authoring plus performance core is only
	3B:

	- MiniCPM5-1B decides what happens and how each page should look.
	- VoxCPM2 performs the authored story and optionally adapts to a family voice.
	- FLUX.2-klein-4B renders the visual plan.

	This division of responsibility is the project's Tiny Titan argument. A large monolithic
	model is not necessary when small specialist models communicate through a precise intermediate
	representation.

	The JSON story object is that representation. It connects narrative text, character identity,
	page scenes, narration, browser rendering, and PDF export.

	---

	## 13. Hackathon Contributions

	### Tiny Titan

	The complete application uses 7B parameters, with a 3B story-and-voice core.

	### OpenBMB

	MiniCPM5-1B authors every live story and scene plan. VoxCPM2 performs every live narration and
	provides the zero-shot custom voice feature.

	### Black Forest Labs

	FLUX.2-klein-4B creates the canonical hero, renders six story pages, and semantically redraws the
	optional coloring pages.

	### Hugging Face

	Hugging Face Spaces hosts the public demo, ZeroGPU supplies the GPU execution environment, and
	the Hub stores the code, article, demo assets, model metadata, and reproducible commit history.

	### Codex

	Codex was used as the coding and documentation agent across the project. It helped implement
	and debug the app, refine prompts, update the story/image/audio generation instructions, prepare
	the technical article, maintain the README, create deployment commits, and publish updates to
	the Hugging Face Space.

	### Open Trace

	Generation settings, timings, model state, and fallback information are visible in the app.

	### Off-Brand

	The product has a custom scrapbook and crayon identity rather than a default Gradio appearance.

	### Field Notes

	The repository includes a dedicated technical write-up on cross-page identity:
	[Cross-Page Character Consistency Without Per-User Training](blog.md).

	---

	## 14. Limitations

	- Character consistency is strong but not mathematically guaranteed; complex doodles can drift.
	- Six FLUX illustrations plus optional coloring redraws remain the longest stage.
	- Zero-shot voice cloning quality depends on the clarity and duration of the uploaded recording.
	- The current public UI is English-only.
	- The story validator can verify structure more easily than literary quality.
	- The current Space uses prompt and reference-image adaptation, not a deployed custom FLUX LoRA.

	These are concrete engineering boundaries, not hidden behavior.

	---

	## 15. Roadmap

	The next technically justified improvements are:

	1. Add automated story-quality scoring and one controlled rewrite pass.
	2. Add stronger visual identity evaluation between the canonical hero and each page.
	3. Complete and evaluate the crayon-style FLUX LoRA against the current prompt-only baseline.
	4. Reintroduce multilingual narration only after the language-specific path meets the same
	reliability and latency requirements as the English Space.
	5. Add explicit consent and retention controls around uploaded family voice references.
	6. Add a short shareable video export combining page turns, narration, and captions.

	---

	## 16. Demo Video

	The published demo video is available here:

	[Watch the MP4 demo](demo-doodlebook.mp4)

	[Open the Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link)

	[View the X/Twitter post](https://x.com/sushruthsgowda/status/2066639063168225452?s=46)

	It walks through the complete process:

	1. Show the original child's doodle.
	2. Select the hero name, a theme, and a narrator.
	3. Start generation and show the streamed progress.
	4. Play a short narration excerpt.
	5. reveal the same hero across several illustrated pages.
	6. Download the story PDF.
	7. Show the matching coloring-book page.
	8. End on the 7B model stack and live Space URL.

	The video is linked from both this article and the Space README.

	---

	## Conclusion

	DoodleBook is not simply text generation followed by image generation. It is one coordinated
	multimodal artifact built around a child's idea.

	MiniCPM5-1B gives that idea narrative structure and meaning. FLUX.2-klein-4B preserves its visual
	identity across a complete book. VoxCPM2 gives it a voice, including the option of a familiar
	family voice without a training job. Hugging Face Spaces and ZeroGPU make the entire experience
	publicly accessible.

	The result is small enough to fit the hackathon's technical constraints, but complete enough to
	feel magical: one drawing becomes a story a child can see, hear, print, keep, and make their own.