--- title: Rupkotha emoji: 🌙 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 6.17.3 app_file: app.py pinned: false short_description: Bedtime stories from kids' drawings, in English & Bengali tags: - backyard-ai - openbmb - modal - children - storytelling - bengali - tts - vision-language-model - track:backyard - sponsor:openbmb - sponsor:modal - achievement:offgrid - achievement:welltuned - achievement:offbrand --- # রূপকথা · Rupkotha A bedtime-story app for kids. A child shows their drawings or toys, asks for a story in their own voice (English or Bengali), and hears it read back in a warm motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI). Inference runs on **Modal** (cloud GPUs) — the Gradio Space is a thin client with zero model weights. ## 🏆 Build Small Hackathon — submission - **Track:** Practical · **Backyard AI** (a custom story generator — the track's own example use case). - **Relevant prizes:** OpenBMB Best MiniCPM Build · Modal Best Use. - **🎬 Demo video:** https://youtu.be/mUUmy5JwBYo - **📣 Social post:** https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z - **🤗 Fine-tuned model:** https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha **The idea.** Rupkotha (রূপকথা, "fairy tale") is a bedtime-story app for 4–9 year-olds who can't yet type. A child shows a crayon drawing or a toy, asks for a story **by voice** in **English or Bengali**, and hears it read aloud in a warm motherly voice — gentle, short, always ending in sleep. **The technical approach.** - **Vision → story:** `MiniCPM-V 4.5` (8B) reads 1–4 images + the request and writes a 120–150-word bedtime fable. - **Native Bengali (our differentiator):** the stock model's Bengali was garbled, so we **fine-tuned MiniCPM-V itself** — distilled ~389 native Bengali stories from a Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on **vLLM**. A held-out eval (Bengali speaker) confirmed it beats the base decisively. - **Voice:** `faster-whisper` for speech input; **VoxCPM2** (English) and **AI4Bharat Indic-TTS** (Bengali) for the motherly voice output. - **Infra:** every model runs on **Modal** (Ollama + vLLM + TTS containers, scale-to-zero A10G/A100); the HF Space holds **zero weights** and just calls Modal functions. All models are well under the 32B limit. Switching/serving is driven by one config object (`core/model_config.py`). > **Running this Space:** it calls Modal at runtime — set `MODAL_TOKEN_ID` and > `MODAL_TOKEN_SECRET` as Space secrets, and deploy `core/modal_infra.py` + > `finetune/serve_vllm.py` first. ## ▶️ Try it 1. Pick a **language** (English / বাংলা) and a **story style**. 2. **Upload 1–4 pictures** — a crayon drawing, a toy, a photo from the day. 3. **Type or speak** what you'd like (e.g. *"a story about my cat"*) — or leave it blank. 4. Hit **✨ Tell me a story** → read it, **press play** to hear it, and **save** favourites. ⏱️ **The first generation can take a few minutes.** The Space scales to zero, so the first request cold-starts the models on Modal (later ones are quick). To see it run smoothly end-to-end, watch the **[90-second demo video](https://youtu.be/mUUmy5JwBYo)**. ## Status Story (EN + BN), STT, and TTS all run on Modal via `core/modal_infra.py`. The Bengali path is served by a **fine-tuned MiniCPM-V 4.5** (see below); English uses the stock model. Every `core/` function degrades gracefully to a safe fallback if a model is unavailable, so the app always shows a story. ## The stack The submission runs a single stack — **Stack A**, the OpenBMB prize path — defined in `core/model_config.py`. (The `StackConfig` machinery remains so a stack could be swapped in, but only Stack A is shipped.) | Layer | Model | ~Params | |---|---|---| | Vision + story | MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM | 8B | | STT | faster-whisper large-v3 | 1.55B | | English TTS | VoxCPM2 (Voice Design) | 2B | | Bengali TTS | AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) | ~0.13B | All OpenBMB-family core models (MiniCPM-V + VoxCPM2) → eligible for the OpenBMB Best MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali. ## Infrastructure All model inference runs on **Modal** — the Gradio Space holds zero model weights and makes zero local inference calls. Two Modal apps back the app: - **`rupkotha`** (`core/modal_infra.py`) — the base runtime: vision/story (MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English, AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path. - **`rupkotha-ft-serve`** (`finetune/serve_vllm.py`) — the **Bengali fine-tuned** MiniCPM-V, served via **vLLM** on an A100-40GB (the full bf16 8B + vision encoder needs more than a 24 GB card). The `core/` wrappers call these functions remotely; switching `COMPUTE_LOCATION` in `model_config.py` is the only change needed to run base inference locally (requires a GPU with 8+ GB VRAM). ```bash uv run modal deploy core/modal_infra.py # base runtime (EN vision, STT, TTS) uv run modal deploy finetune/serve_vllm.py # Bengali fine-tuned model (vLLM) ``` ## Bengali fine-tune — improving the OpenBMB model itself The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled, repetitive Bengali. Rather than swap in a different model, we **fine-tuned MiniCPM-V itself** to fix it while keeping the rest of the stack intact. The whole pipeline lives in `finetune/` and runs on Modal: 1. **Distill** native Bengali bedtime stories from a Gemma 3 teacher over ~450 children's drawings, filtered by a purity gate → **389 high-quality examples**. 2. **LoRA fine-tune** MiniCPM-V 4.5 with **ms-SWIFT** on an A100-80GB — vision encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part). 3. **Merge** the adapter into standalone weights and **serve via vLLM**. 4. **Evaluate** FT vs the stock model on held-out drawings (`finetune/eval_ft.py`): the fine-tune wins decisively — coherent native রূপকথা vs garbled, looping output — confirmed by a Bengali speaker. Bengali story requests now route to the fine-tuned model (`FINETUNED_VISION_MODEL` in `model_config.py`, one-line revert to `None`); English and audio paths are unchanged. See `finetune/README.md` for the full pipeline. 📦 **The merged fine-tuned model is published on the Hub:** [`debrajsingha/minicpm-v45-bengali-rupkotha`](https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha). ## Tracks & sponsor fit Where this project lines up with the hackathon's tracks and sponsor themes: | Track / sponsor | How it fits | |---|---| | Backyard AI track | A custom story generator — the track's own example use case | | OpenBMB (MiniCPM) | Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali | | Modal | All inference runs on Modal — base runtime + a vLLM-served fine-tuned model, plus the LoRA training | ## Environment & setup This project uses **[uv](https://docs.astral.sh/uv/)** for all package and environment management (Python 3.10–3.12; VoxCPM2 requires < 3.13). ```bash uv venv --python 3.12 # create the environment uv sync # install locked dependencies uv run python app.py # launch the Gradio app ``` Add dependencies with `uv add ` (never `pip`). Local dev uses `uv` (`pyproject.toml` + `uv.lock`); `requirements.txt` is intentionally minimal (`gradio` + `modal`) — it's only for the HF Space, which is a thin client that calls Modal and holds no model weights. ## Project layout - `app.py` — Gradio UI (thin client; orchestration + wiring only, zero model weights). - `core/model_config.py` — single source of truth for the model stack + compute config. - `core/modal_infra.py` — **all** Modal GPU functions (vision, STT, TTS, translate). - `core/vision_story.py` · `core/stt.py` · `core/tts.py` · `core/prompts.py` — thin wrappers that build prompts and call the Modal functions. - `finetune/` — the Bengali fine-tune pipeline (collect → label → train → merge → serve → eval). See `finetune/README.md`. - `assets/styles.css` — the night-sky / storybook theme.