Spaces:
Running
Running
| title: Rupkotha | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 6.17.3 | |
| app_file: app.py | |
| pinned: false | |
| short_description: Bedtime stories from kids' drawings, in English & Bengali | |
| tags: | |
| - backyard-ai | |
| - openbmb | |
| - modal | |
| - children | |
| - storytelling | |
| - bengali | |
| - tts | |
| - vision-language-model | |
| - track:backyard | |
| - sponsor:openbmb | |
| - sponsor:modal | |
| - achievement:offgrid | |
| - achievement:welltuned | |
| - achievement:offbrand | |
| # ΰ¦°ΰ§ΰ¦ͺΰ¦ΰ¦₯ΰ¦Ύ Β· Rupkotha | |
| A bedtime-story app for kids. A child shows their drawings or toys, asks for a | |
| story in their own voice (English or Bengali), and hears it read back in a warm | |
| motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI). | |
| Inference runs on **Modal** (cloud GPUs) β the Gradio Space is a thin client with | |
| zero model weights. | |
| ## π Build Small Hackathon β submission | |
| - **Track:** Practical Β· **Backyard AI** (a custom story generator β the track's own example use case). | |
| - **Relevant prizes:** OpenBMB Best MiniCPM Build Β· Modal Best Use. | |
| - **π¬ Demo video:** https://youtu.be/mUUmy5JwBYo | |
| - **π£ Social post:** https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z | |
| - **π€ Fine-tuned model:** https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha | |
| **The idea.** Rupkotha (ΰ¦°ΰ§ΰ¦ͺΰ¦ΰ¦₯ΰ¦Ύ, "fairy tale") is a bedtime-story app for 4β9 year-olds | |
| who can't yet type. A child shows a crayon drawing or a toy, asks for a story **by | |
| voice** in **English or Bengali**, and hears it read aloud in a warm motherly | |
| voice β gentle, short, always ending in sleep. | |
| **The technical approach.** | |
| - **Vision β story:** `MiniCPM-V 4.5` (8B) reads 1β4 images + the request and writes | |
| a 120β150-word bedtime fable. | |
| - **Native Bengali (our differentiator):** the stock model's Bengali was garbled, so | |
| we **fine-tuned MiniCPM-V itself** β distilled ~389 native Bengali stories from a | |
| Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on **vLLM**. A held-out | |
| eval (Bengali speaker) confirmed it beats the base decisively. | |
| - **Voice:** `faster-whisper` for speech input; **VoxCPM2** (English) and **AI4Bharat | |
| Indic-TTS** (Bengali) for the motherly voice output. | |
| - **Infra:** every model runs on **Modal** (Ollama + vLLM + TTS containers, scale-to-zero | |
| A10G/A100); the HF Space holds **zero weights** and just calls Modal functions. All | |
| models are well under the 32B limit. Switching/serving is driven by one config object | |
| (`core/model_config.py`). | |
| > **Running this Space:** it calls Modal at runtime β set `MODAL_TOKEN_ID` and | |
| > `MODAL_TOKEN_SECRET` as Space secrets, and deploy `core/modal_infra.py` + | |
| > `finetune/serve_vllm.py` first. | |
| ## βΆοΈ Try it | |
| 1. Pick a **language** (English / বাΰ¦ΰ¦²ΰ¦Ύ) and a **story style**. | |
| 2. **Upload 1β4 pictures** β a crayon drawing, a toy, a photo from the day. | |
| 3. **Type or speak** what you'd like (e.g. *"a story about my cat"*) β or leave it blank. | |
| 4. Hit **β¨ Tell me a story** β read it, **press play** to hear it, and **save** favourites. | |
| β±οΈ **The first generation can take a few minutes.** The Space scales to zero, so the | |
| first request cold-starts the models on Modal (later ones are quick). To see it run | |
| smoothly end-to-end, watch the **[90-second demo video](https://youtu.be/mUUmy5JwBYo)**. | |
| ## Status | |
| Story (EN + BN), STT, and TTS all run on Modal via `core/modal_infra.py`. The Bengali | |
| path is served by a **fine-tuned MiniCPM-V 4.5** (see below); English uses the stock | |
| model. Every `core/` function degrades gracefully to a safe fallback if a model is | |
| unavailable, so the app always shows a story. | |
| ## The stack | |
| The submission runs a single stack β **Stack A**, the OpenBMB prize path β defined in | |
| `core/model_config.py`. (The `StackConfig` machinery remains so a stack could be | |
| swapped in, but only Stack A is shipped.) | |
| | Layer | Model | ~Params | | |
| |---|---|---| | |
| | Vision + story | MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM | 8B | | |
| | STT | faster-whisper large-v3 | 1.55B | | |
| | English TTS | VoxCPM2 (Voice Design) | 2B | | |
| | Bengali TTS | AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) | ~0.13B | | |
| All OpenBMB-family core models (MiniCPM-V + VoxCPM2) β eligible for the OpenBMB Best | |
| MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali. | |
| ## Infrastructure | |
| All model inference runs on **Modal** β the Gradio Space holds zero model weights | |
| and makes zero local inference calls. Two Modal apps back the app: | |
| - **`rupkotha`** (`core/modal_infra.py`) β the base runtime: vision/story | |
| (MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English, | |
| AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path. | |
| - **`rupkotha-ft-serve`** (`finetune/serve_vllm.py`) β the **Bengali fine-tuned** | |
| MiniCPM-V, served via **vLLM** on an A100-40GB (the full bf16 8B + vision | |
| encoder needs more than a 24 GB card). | |
| The `core/` wrappers call these functions remotely; switching | |
| `COMPUTE_LOCATION` in `model_config.py` is the only change needed to run base | |
| inference locally (requires a GPU with 8+ GB VRAM). | |
| ```bash | |
| uv run modal deploy core/modal_infra.py # base runtime (EN vision, STT, TTS) | |
| uv run modal deploy finetune/serve_vllm.py # Bengali fine-tuned model (vLLM) | |
| ``` | |
| ## Bengali fine-tune β improving the OpenBMB model itself | |
| The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled, | |
| repetitive Bengali. Rather than swap in a different model, we **fine-tuned | |
| MiniCPM-V itself** to fix it while keeping the rest of the stack intact. The whole | |
| pipeline lives in `finetune/` and runs on Modal: | |
| 1. **Distill** native Bengali bedtime stories from a Gemma 3 teacher over ~450 | |
| children's drawings, filtered by a purity gate β **389 high-quality examples**. | |
| 2. **LoRA fine-tune** MiniCPM-V 4.5 with **ms-SWIFT** on an A100-80GB β vision | |
| encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part). | |
| 3. **Merge** the adapter into standalone weights and **serve via vLLM**. | |
| 4. **Evaluate** FT vs the stock model on held-out drawings (`finetune/eval_ft.py`): | |
| the fine-tune wins decisively β coherent native ΰ¦°ΰ§ΰ¦ͺΰ¦ΰ¦₯ΰ¦Ύ vs garbled, looping output | |
| β confirmed by a Bengali speaker. | |
| Bengali story requests now route to the fine-tuned model | |
| (`FINETUNED_VISION_MODEL` in `model_config.py`, one-line revert to `None`); English | |
| and audio paths are unchanged. See `finetune/README.md` for the full pipeline. | |
| π¦ **The merged fine-tuned model is published on the Hub:** | |
| [`debrajsingha/minicpm-v45-bengali-rupkotha`](https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha). | |
| ## Tracks & sponsor fit | |
| Where this project lines up with the hackathon's tracks and sponsor themes: | |
| | Track / sponsor | How it fits | | |
| |---|---| | |
| | Backyard AI track | A custom story generator β the track's own example use case | | |
| | OpenBMB (MiniCPM) | Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali | | |
| | Modal | All inference runs on Modal β base runtime + a vLLM-served fine-tuned model, plus the LoRA training | | |
| ## Environment & setup | |
| This project uses **[uv](https://docs.astral.sh/uv/)** for all package and | |
| environment management (Python 3.10β3.12; VoxCPM2 requires < 3.13). | |
| ```bash | |
| uv venv --python 3.12 # create the environment | |
| uv sync # install locked dependencies | |
| uv run python app.py # launch the Gradio app | |
| ``` | |
| Add dependencies with `uv add <pkg>` (never `pip`). Local dev uses `uv` | |
| (`pyproject.toml` + `uv.lock`); `requirements.txt` is intentionally minimal | |
| (`gradio` + `modal`) β it's only for the HF Space, which is a thin client that | |
| calls Modal and holds no model weights. | |
| ## Project layout | |
| - `app.py` β Gradio UI (thin client; orchestration + wiring only, zero model weights). | |
| - `core/model_config.py` β single source of truth for the model stack + compute config. | |
| - `core/modal_infra.py` β **all** Modal GPU functions (vision, STT, TTS, translate). | |
| - `core/vision_story.py` Β· `core/stt.py` Β· `core/tts.py` Β· `core/prompts.py` β thin | |
| wrappers that build prompts and call the Modal functions. | |
| - `finetune/` β the Bengali fine-tune pipeline (collect β label β train β merge β | |
| serve β eval). See `finetune/README.md`. | |
| - `assets/styles.css` β the night-sky / storybook theme. | |