---
title: Rupkotha
emoji: 🌙
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: false
short_description: Bedtime stories from kids' drawings, in English & Bengali
tags:
  - backyard-ai
  - openbmb
  - modal
  - children
  - storytelling
  - bengali
  - tts
  - vision-language-model
  - track:backyard
  - sponsor:openbmb
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand
---

# রূপকথা · Rupkotha

A bedtime-story app for kids. A child shows their drawings or toys, asks for a
story in their own voice (English or Bengali), and hears it read back in a warm
motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI).

Inference runs on **Modal** (cloud GPUs) — the Gradio Space is a thin client with
zero model weights.

## 🏆 Build Small Hackathon — submission

- **Track:** Practical · **Backyard AI** (a custom story generator — the track's own example use case).
- **Relevant prizes:** OpenBMB Best MiniCPM Build · Modal Best Use.
- **🎬 Demo video:** https://youtu.be/mUUmy5JwBYo
- **📣 Social post:** https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z
- **🤗 Fine-tuned model:** https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha

**The idea.** Rupkotha (রূপকথা, "fairy tale") is a bedtime-story app for 4–9 year-olds
who can't yet type. A child shows a crayon drawing or a toy, asks for a story **by
voice** in **English or Bengali**, and hears it read aloud in a warm motherly
voice — gentle, short, always ending in sleep.

**The technical approach.**
- **Vision → story:** `MiniCPM-V 4.5` (8B) reads 1–4 images + the request and writes
  a 120–150-word bedtime fable.
- **Native Bengali (our differentiator):** the stock model's Bengali was garbled, so
  we **fine-tuned MiniCPM-V itself** — distilled ~389 native Bengali stories from a
  Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on **vLLM**. A held-out
  eval (Bengali speaker) confirmed it beats the base decisively.
- **Voice:** `faster-whisper` for speech input; **VoxCPM2** (English) and **AI4Bharat
  Indic-TTS** (Bengali) for the motherly voice output.
- **Infra:** every model runs on **Modal** (Ollama + vLLM + TTS containers, scale-to-zero
  A10G/A100); the HF Space holds **zero weights** and just calls Modal functions. All
  models are well under the 32B limit. Switching/serving is driven by one config object
  (`core/model_config.py`).

> **Running this Space:** it calls Modal at runtime — set `MODAL_TOKEN_ID` and
> `MODAL_TOKEN_SECRET` as Space secrets, and deploy `core/modal_infra.py` +
> `finetune/serve_vllm.py` first.

## ▶️ Try it

1. Pick a **language** (English / বাংলা) and a **story style**.
2. **Upload 1–4 pictures** — a crayon drawing, a toy, a photo from the day.
3. **Type or speak** what you'd like (e.g. *"a story about my cat"*) — or leave it blank.
4. Hit **✨ Tell me a story** → read it, **press play** to hear it, and **save** favourites.

⏱️ **The first generation can take a few minutes.** The Space scales to zero, so the
first request cold-starts the models on Modal (later ones are quick). To see it run
smoothly end-to-end, watch the **[90-second demo video](https://youtu.be/mUUmy5JwBYo)**.

## Status

Story (EN + BN), STT, and TTS all run on Modal via `core/modal_infra.py`. The Bengali
path is served by a **fine-tuned MiniCPM-V 4.5** (see below); English uses the stock
model. Every `core/` function degrades gracefully to a safe fallback if a model is
unavailable, so the app always shows a story.

## The stack

The submission runs a single stack — **Stack A**, the OpenBMB prize path — defined in
`core/model_config.py`. (The `StackConfig` machinery remains so a stack could be
swapped in, but only Stack A is shipped.)

| Layer | Model | ~Params |
|---|---|---|
| Vision + story | MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM | 8B |
| STT | faster-whisper large-v3 | 1.55B |
| English TTS | VoxCPM2 (Voice Design) | 2B |
| Bengali TTS | AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) | ~0.13B |

All OpenBMB-family core models (MiniCPM-V + VoxCPM2) → eligible for the OpenBMB Best
MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali.

## Infrastructure

All model inference runs on **Modal** — the Gradio Space holds zero model weights
and makes zero local inference calls. Two Modal apps back the app:

- **`rupkotha`** (`core/modal_infra.py`) — the base runtime: vision/story
  (MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English,
  AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path.
- **`rupkotha-ft-serve`** (`finetune/serve_vllm.py`) — the **Bengali fine-tuned**
  MiniCPM-V, served via **vLLM** on an A100-40GB (the full bf16 8B + vision
  encoder needs more than a 24 GB card).

The `core/` wrappers call these functions remotely; switching
`COMPUTE_LOCATION` in `model_config.py` is the only change needed to run base
inference locally (requires a GPU with 8+ GB VRAM).

```bash
uv run modal deploy core/modal_infra.py        # base runtime (EN vision, STT, TTS)
uv run modal deploy finetune/serve_vllm.py     # Bengali fine-tuned model (vLLM)
```

## Bengali fine-tune — improving the OpenBMB model itself

The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled,
repetitive Bengali. Rather than swap in a different model, we **fine-tuned
MiniCPM-V itself** to fix it while keeping the rest of the stack intact. The whole
pipeline lives in `finetune/` and runs on Modal:

1. **Distill** native Bengali bedtime stories from a Gemma 3 teacher over ~450
   children's drawings, filtered by a purity gate → **389 high-quality examples**.
2. **LoRA fine-tune** MiniCPM-V 4.5 with **ms-SWIFT** on an A100-80GB — vision
   encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part).
3. **Merge** the adapter into standalone weights and **serve via vLLM**.
4. **Evaluate** FT vs the stock model on held-out drawings (`finetune/eval_ft.py`):
   the fine-tune wins decisively — coherent native রূপকথা vs garbled, looping output
   — confirmed by a Bengali speaker.

Bengali story requests now route to the fine-tuned model
(`FINETUNED_VISION_MODEL` in `model_config.py`, one-line revert to `None`); English
and audio paths are unchanged. See `finetune/README.md` for the full pipeline.

📦 **The merged fine-tuned model is published on the Hub:**
[`debrajsingha/minicpm-v45-bengali-rupkotha`](https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha).

## Tracks & sponsor fit

Where this project lines up with the hackathon's tracks and sponsor themes:

| Track / sponsor | How it fits |
|---|---|
| Backyard AI track | A custom story generator — the track's own example use case |
| OpenBMB (MiniCPM) | Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali |
| Modal | All inference runs on Modal — base runtime + a vLLM-served fine-tuned model, plus the LoRA training |

## Environment & setup

This project uses **[uv](https://docs.astral.sh/uv/)** for all package and
environment management (Python 3.10–3.12; VoxCPM2 requires < 3.13).

```bash
uv venv --python 3.12      # create the environment
uv sync                    # install locked dependencies
uv run python app.py       # launch the Gradio app
```

Add dependencies with `uv add <pkg>` (never `pip`). Local dev uses `uv`
(`pyproject.toml` + `uv.lock`); `requirements.txt` is intentionally minimal
(`gradio` + `modal`) — it's only for the HF Space, which is a thin client that
calls Modal and holds no model weights.

## Project layout

- `app.py` — Gradio UI (thin client; orchestration + wiring only, zero model weights).
- `core/model_config.py` — single source of truth for the model stack + compute config.
- `core/modal_infra.py` — **all** Modal GPU functions (vision, STT, TTS, translate).
- `core/vision_story.py` · `core/stt.py` · `core/tts.py` · `core/prompts.py` — thin
  wrappers that build prompts and call the Modal functions.
- `finetune/` — the Bengali fine-tune pipeline (collect → label → train → merge →
  serve → eval). See `finetune/README.md`.
- `assets/styles.css` — the night-sky / storybook theme.