Spaces:

build-small-hackathon
/

DoodleBook

Running on Zero

File size: 20,221 Bytes

# DoodleBook: Turning a Child's Drawing into a Narrated, Illustrated Storybook

**Build Small Hackathon 2026 - Adventure in Thousand Token Wood**

**Live demo:** [build-small-hackathon/DoodleBook](https://huggingface.co/spaces/build-small-hackathon/DoodleBook)

**Demo video:** [MP4 demo](demo-doodlebook.mp4) and [Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link)

**Social post:** [X/Twitter announcement](https://x.com/sushruthsgowda/status/2066639063168225452?s=46)

**Source:** [github.com/Sushruths04/Doodle-book](https://github.com/Sushruths04/Doodle-book)

**License:** Apache-2.0

---

## Abstract

DoodleBook is a multimodal creative-learning application that turns one child's drawing
into a complete six-page picture book. A child uploads a doodle, names the hero, selects
one of ten meaningful themes, and chooses a narrator. The system writes an age-appropriate
story, converts the doodle into a consistent storybook character, illustrates every page,
narrates the story, exports a printable PDF, and can also produce a matching coloring book.

The complete deployed model stack is only **7B parameters**:

| Role | Model | Parameters |
|---|---|---:|
| Story author and scene planner | [`openbmb/MiniCPM5-1B`](https://huggingface.co/openbmb/MiniCPM5-1B) | 1B |
| Expressive narration and zero-shot voice cloning | [`openbmb/VoxCPM2`](https://huggingface.co/openbmb/VoxCPM2) | 2B |
| Character design, illustrations, and coloring-page redraws | [`black-forest-labs/FLUX.2-klein-4B`](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B) | 4B |

The central product idea is that the child's drawing is not merely an input image. It is
the identity anchor for the whole experience. The same hero appears in the narrative, the
six illustrations, the narration, the story PDF, and the optional coloring book.

---

## 1. Objective

Children rarely see their own imperfect drawings treated as finished creative work.
Generative applications can produce polished images, but they often replace the child's
idea rather than extend it. DoodleBook has a different objective:

> Preserve the personality of a child's drawing and make that drawing the hero of a
> meaningful story the child can read, hear, keep, print, and color.

That objective creates five technical requirements:

1. The story must be joyful, understandable, and emotionally meaningful for a young child.
2. The selected theme must influence the actual lesson and plot, not just appear as a label.
3. The hero must remain visually recognizable across all six illustrations.
4. The selected narrator or uploaded family voice must match the final audio.
5. The complete pipeline must run as a public Hugging Face Space within ZeroGPU constraints.

---

## 2. The User Experience

The current Space deliberately keeps the input small:

1. Upload or photograph a drawing.
2. Enter the hero's name.
3. Select one of ten story themes.
4. Select a narrator, including the optional **My Voice** mode.
5. Choose whether to generate a matching coloring book.
6. Select **Make my book**.

The available themes are:

- brave adventure
- making a new friend
- overcoming a fear
- helping someone
- lost and found
- learning something new
- kindness to animals
- the magic of imagination
- celebrating who you are
- a rainy day adventure

The narrator choices are Little Kid, Big Kid, Playful, Storyteller, Grandpa, and My Voice.
For My Voice, the user uploads a short clear recording and VoxCPM2 performs zero-shot voice
cloning from that reference.

The result is revealed in a deliberate order:

1. Narration audio becomes available first.
2. The illustrated story PDF download appears next.
3. The full illustrated pages are revealed last.
4. If requested, the coloring-book preview and PDF are also produced.

This order gives the child something to hear as soon as possible while the heavier document
and page-rendering work completes.

---

## 3. End-to-End Architecture

```text
Child's doodle + hero name + theme + narrator
                         |
                         v
            MiniCPM5-1B story generation
          title + character + text + scenes
                    /              \
                   /                \
                  v                  v
       FLUX canonical hero       VoxCPM2 narration
                  |             preset or cloned voice
                  v
        FLUX six page renders
                  |
          +-------+--------+
          |                |
          v                v
   Illustrated PDF   FLUX line-art redraw
                           |
                           v
                    Coloring-book PDF
```

The pipeline is orchestrated in a Gradio 6 application hosted on Hugging Face Spaces.
The models are loaded using the ZeroGPU-compatible module-scope pattern, and GPU-heavy
stages are isolated behind `@spaces.GPU` functions.

---

## 4. MiniCPM5-1B: Story Author and Visual Planner

MiniCPM5-1B does more than write prose. It produces a structured JSON plan containing:

- a short book title
- a reusable visual description of the hero
- six page texts
- six corresponding illustration scene descriptions

Each scene description is consumed directly by the image pipeline, so the language model
acts as both **author** and **visual director**.

### Theme-conditioned meaning

Every selectable theme has its own narrative guidance. For example:

- **Overcoming a fear** treats fear respectfully and shows courage as a supported small step.
- **Making a new friend** rewards listening, sharing, or a sincere hello rather than popularity.
- **Learning something new** includes an imperfect attempt, guidance, practice, and improvement.
- **Celebrating who you are** focuses on self-acceptance without making the hero superior.
- **Kindness to animals** emphasizes gentle, age-appropriate care and trusted adult support.

This prevents ten buttons from producing the same generic adventure with different titles.

### Prompt design for a small model

The prompt is detailed but organized into a small number of explicit responsibilities:

1. **Story quality:** one coherent arc, natural read-aloud language, dialogue, humor, and
   purposeful sound effects such as `BOOM!`, `WHOOSH!`, or `SPLASH!`.
2. **Emotional meaning:** the lesson must follow from what the hero actually chooses and does.
3. **Page progression:** introduction, first attempt, complication, realization, action,
   and emotional payoff.
4. **Visual continuity:** the hero remains active on every page and retains the same memorable
   physical traits.
5. **Illustration planning:** each scene describes one drawable moment with visible characters,
   action, setting, props, light, color mood, and emotion.
6. **Strict output:** valid JSON only, with exactly six page objects.

A complete six-page exemplar establishes the quality target. The example demonstrates warmth,
continuity, useful dialogue, sound effects, a safe treatment of fear, and a lesson shown through
action rather than attached as an unrelated final sentence.

### Reliability

The output parser extracts and repairs JSON when possible. If language-model generation fails,
the application has local theme-based story arcs so the user still receives a book rather than
a broken session. Errors and fallback information are exposed in the trace instead of being
hidden.

---

## 5. FLUX.2-klein-4B: One Drawing, One Hero, Six Pages

The hardest visual problem is cross-page character consistency. Six independent text-to-image
requests tend to produce six different heroes. Color, clothing, face shape, body proportions,
and accessories can drift enough to break the feeling that this is one book.

DoodleBook solves this without training a new model for every child.

### Stage 1: canonical-character generation

The uploaded doodle is passed through FLUX img2img once. The prompt instructs FLUX to:

- preserve the creature or person type
- preserve face, body shape, colors, markings, clothing, and accessories
- clarify unclear lines without replacing the child's idea
- produce one full-body hero on a neutral background
- avoid extra characters, scenery, text, or duplicate views

This output becomes a canonical model sheet for the book.

### Stage 2: reference-conditioned page generation

Each story scene is rendered using the same canonical hero image. The page prompt explicitly
locks:

- face and species
- body proportions
- colors and markings
- clothing and accessories
- child-safe crayon storybook style

The scene still changes from page to page, but the identity reference remains constant.
Deterministic seeds provide reproducibility while page-specific seed offsets allow variation.

If the user does not upload a doodle, the structured character description from MiniCPM5-1B
becomes the text-based identity anchor.

### Why this approach matters

Per-user LoRA training would add a training job before every book. That is too slow and
operationally expensive for an interactive children's application. Canonical image conditioning
provides personalization at inference time, using the child's actual drawing as the reference.

---

## 6. VoxCPM2: Expressive Narration and Custom Family Voices

VoxCPM2 converts the title and all six page texts into one narrated story. The application
splits the text into sentences, generates speech for each sentence, and inserts short pauses
between them so the result sounds like a book being read rather than one continuous block.

### Narrator presets

Each narrator option uses a separate voice-design instruction:

- **Little Kid:** bright, playful, and full of wonder
- **Big Kid:** confident, youthful, and energetic
- **Playful:** animated delivery and light comic timing
- **Storyteller:** soft, soothing bedtime pacing
- **Grandpa:** warm, patient, and reassuring
- **My Voice:** preserves the uploaded reference speaker while adding clear storytelling delivery

The voice prompts also explain how to perform dialogue and sound effects. A bedtime voice softens
`BOOM!`; a playful voice gives it energy without becoming harsh.

### My Voice: zero-shot cloning, not fine-tuning

The deployed custom voice feature does **not** train a new audio model. VoxCPM2 is loaded with its
denoiser enabled and receives the uploaded recording through `reference_wav_path`. It performs
zero-shot voice cloning at inference time.

This distinction is important:

- no user-specific checkpoint is created
- no training job is required
- the user can hear a familiar family voice from one short reference recording
- custom-voice narration is scheduled sequentially after images to avoid concurrent ZeroGPU
  contention

Preset narration can run in parallel with illustration generation. Custom cloning is slower, so
the number of cloned sentences is capped to fit the public GPU budget.

---

## 7. Fine-Tuning and Adaptation: What We Built

DoodleBook uses several forms of model adaptation, but they should not be confused.

### Deployed in the current Space

| Technique | Model | Status |
|---|---|---|
| Theme-specific structured prompting and few-shot story guidance | MiniCPM5-1B | Deployed |
| Canonical-image conditioning from the child's doodle | FLUX.2-klein-4B | Deployed |
| Deterministic seed and character-description identity anchors | FLUX.2-klein-4B | Deployed |
| Voice-design prompting for six narrator styles | VoxCPM2 | Deployed |
| Zero-shot voice cloning from uploaded reference audio | VoxCPM2 | Deployed |
| Img2img semantic redraw for matching coloring pages | FLUX.2-klein-4B | Deployed |

### Fine-tuning work in the repository

The repository also contains an experimental Kannada narration path referencing
[`sush0401/IndicF5-Kannada-Bedtime-v2`](https://huggingface.co/sush0401/IndicF5-Kannada-Bedtime-v2),
a bedtime-oriented IndicF5 checkpoint, with MMS-TTS and gTTS fallback tiers. This work explored
language-specific expressive narration and reference-voice support.

The current competition Space was simplified to one English storybook flow powered by VoxCPM2,
so the Kannada checkpoint is **not loaded by the deployed `app.py` path**. It is supporting
experimental work, not a dependency of the live demo.

The repository also includes a FLUX LoRA training scaffold for a more uniform crayon style.
No published LoRA is required by the current Space. The live visual consistency comes from
canonical image conditioning, prompt constraints, and deterministic seeds.

This transparent separation makes the deployed result reproducible and avoids overstating
fine-tuning that is not active in the public demo.

---

## 8. Coloring Book: Redraw, Do Not Trace

A naive coloring-book implementation applies edge detection to the finished crayon pages.
That preserves every crayon grain, shadow, and background texture, producing noisy outlines
that are unpleasant to color.

DoodleBook instead sends each finished color page back to FLUX with an img2img prompt that asks
for the same characters, action, emotion, and composition as clean line art. The model performs
a semantic redraw rather than a literal texture trace.

A lightweight local cleanup then:

- converts the result to grayscale
- increases contrast
- thresholds it toward pure black and white
- removes small speckles

If the FLUX line-art pass fails, the application can fall back to local outline extraction.

---

## 9. PDFs and a Product Children Can Keep

The generated book is not limited to a temporary browser result. DoodleBook creates:

- a browser-readable illustrated book
- a downloadable illustrated story PDF
- narration audio
- an optional browser-readable coloring book
- an optional printable coloring-book PDF

The PDF cover follows the same warm paper-and-crayon visual language as the application.
This matters to the objective: the output becomes something a child can keep, share with family,
read at bedtime, or color away from a screen.

---

## 10. ZeroGPU Engineering

The public application runs on Hugging Face Spaces using one ZeroGPU Space.

Important implementation decisions include:

- loading the three primary models at module scope in the ZeroGPU-compatible pattern
- separating story, image, coloring, and TTS work into GPU-decorated functions
- using six-step FLUX generation at guidance scale 1.0 for interactive latency
- running preset narration alongside illustration work
- running custom voice cloning sequentially to avoid GPU contention
- streaming heartbeat status updates during long generation stages
- loading a pre-generated sample book immediately without requiring GPU allocation
- preserving partial usability through story, image, and coloring fallback paths

The UI is a custom Gradio Blocks interface rather than a default component stack. It uses
paper textures, hand-drawn borders, crayon colors, and child-friendly typography while remaining
usable on desktop and mobile.

---

## 11. Open Trace and Reproducibility

Every generated book exposes a trace containing:

- selected hero name
- selected theme
- selected narrator
- coloring-book selection
- model backend
- character description
- deterministic seed
- story, image, narration, PDF, and coloring timings
- rendering engine
- surfaced load or generation errors
- fallback details

This serves two audiences. Developers can understand performance and failure modes, while judges
and users can see that the result came from a reproducible pipeline rather than a hidden manual
process.

---

## 12. Why the Model Stack Is Small but Effective

DoodleBook's full stack is **7B parameters**, and its authoring plus performance core is only
**3B**:

- MiniCPM5-1B decides what happens and how each page should look.
- VoxCPM2 performs the authored story and optionally adapts to a family voice.
- FLUX.2-klein-4B renders the visual plan.

This division of responsibility is the project's **Tiny Titan** argument. A large monolithic
model is not necessary when small specialist models communicate through a precise intermediate
representation.

The JSON story object is that representation. It connects narrative text, character identity,
page scenes, narration, browser rendering, and PDF export.

---

## 13. Hackathon Contributions

### Tiny Titan

The complete application uses 7B parameters, with a 3B story-and-voice core.

### OpenBMB

MiniCPM5-1B authors every live story and scene plan. VoxCPM2 performs every live narration and
provides the zero-shot custom voice feature.

### Black Forest Labs

FLUX.2-klein-4B creates the canonical hero, renders six story pages, and semantically redraws the
optional coloring pages.

### Hugging Face

Hugging Face Spaces hosts the public demo, ZeroGPU supplies the GPU execution environment, and
the Hub stores the code, article, demo assets, model metadata, and reproducible commit history.

### Codex

Codex was used as the coding and documentation agent across the project. It helped implement
and debug the app, refine prompts, update the story/image/audio generation instructions, prepare
the technical article, maintain the README, create deployment commits, and publish updates to
the Hugging Face Space.

### Open Trace

Generation settings, timings, model state, and fallback information are visible in the app.

### Off-Brand

The product has a custom scrapbook and crayon identity rather than a default Gradio appearance.

### Field Notes

The repository includes a dedicated technical write-up on cross-page identity:
[Cross-Page Character Consistency Without Per-User Training](blog.md).

---

## 14. Limitations

- Character consistency is strong but not mathematically guaranteed; complex doodles can drift.
- Six FLUX illustrations plus optional coloring redraws remain the longest stage.
- Zero-shot voice cloning quality depends on the clarity and duration of the uploaded recording.
- The current public UI is English-only.
- The story validator can verify structure more easily than literary quality.
- The current Space uses prompt and reference-image adaptation, not a deployed custom FLUX LoRA.

These are concrete engineering boundaries, not hidden behavior.

---

## 15. Roadmap

The next technically justified improvements are:

1. Add automated story-quality scoring and one controlled rewrite pass.
2. Add stronger visual identity evaluation between the canonical hero and each page.
3. Complete and evaluate the crayon-style FLUX LoRA against the current prompt-only baseline.
4. Reintroduce multilingual narration only after the language-specific path meets the same
   reliability and latency requirements as the English Space.
5. Add explicit consent and retention controls around uploaded family voice references.
6. Add a short shareable video export combining page turns, narration, and captions.

---

## 16. Demo Video

The published demo video is available here:

**[Watch the MP4 demo](demo-doodlebook.mp4)**

**[Open the Supademo walkthrough](https://app.supademo.com/demo/cmqfkwlro4f4wqmgj218kxnqp?utm_source=link)**

**[View the X/Twitter post](https://x.com/sushruthsgowda/status/2066639063168225452?s=46)**

It walks through the complete process:

1. Show the original child's doodle.
2. Select the hero name, a theme, and a narrator.
3. Start generation and show the streamed progress.
4. Play a short narration excerpt.
5. reveal the same hero across several illustrated pages.
6. Download the story PDF.
7. Show the matching coloring-book page.
8. End on the 7B model stack and live Space URL.

The video is linked from both this article and the Space README.

---

## Conclusion

DoodleBook is not simply text generation followed by image generation. It is one coordinated
multimodal artifact built around a child's idea.

MiniCPM5-1B gives that idea narrative structure and meaning. FLUX.2-klein-4B preserves its visual
identity across a complete book. VoxCPM2 gives it a voice, including the option of a familiar
family voice without a training job. Hugging Face Spaces and ZeroGPU make the entire experience
publicly accessible.

The result is small enough to fit the hackathon's technical constraints, but complete enough to
feel magical: one drawing becomes a story a child can see, hear, print, keep, and make their own.