rupkotha / README.md
Deb
Prepare for HF Space: disable mock, module-level demo, README
bf9b480
|
Raw
History Blame Contribute Delete
8.39 kB
---
title: Rupkotha
emoji: πŸŒ™
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: false
short_description: Bedtime stories from kids' drawings, in English & Bengali
tags:
- backyard-ai
- openbmb
- modal
- children
- storytelling
- bengali
- tts
- vision-language-model
- track:backyard
- sponsor:openbmb
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
---
# ΰ¦°ΰ§‚ΰ¦ͺকΰ¦₯ΰ¦Ύ Β· Rupkotha
A bedtime-story app for kids. A child shows their drawings or toys, asks for a
story in their own voice (English or Bengali), and hears it read back in a warm
motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI).
Inference runs on **Modal** (cloud GPUs) β€” the Gradio Space is a thin client with
zero model weights.
## πŸ† Build Small Hackathon β€” submission
- **Track:** Practical Β· **Backyard AI** (a custom story generator β€” the track's own example use case).
- **Relevant prizes:** OpenBMB Best MiniCPM Build Β· Modal Best Use.
- **🎬 Demo video:** https://youtu.be/mUUmy5JwBYo
- **πŸ“£ Social post:** https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z
- **πŸ€— Fine-tuned model:** https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha
**The idea.** Rupkotha (ΰ¦°ΰ§‚ΰ¦ͺকΰ¦₯ΰ¦Ύ, "fairy tale") is a bedtime-story app for 4–9 year-olds
who can't yet type. A child shows a crayon drawing or a toy, asks for a story **by
voice** in **English or Bengali**, and hears it read aloud in a warm motherly
voice β€” gentle, short, always ending in sleep.
**The technical approach.**
- **Vision β†’ story:** `MiniCPM-V 4.5` (8B) reads 1–4 images + the request and writes
a 120–150-word bedtime fable.
- **Native Bengali (our differentiator):** the stock model's Bengali was garbled, so
we **fine-tuned MiniCPM-V itself** β€” distilled ~389 native Bengali stories from a
Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on **vLLM**. A held-out
eval (Bengali speaker) confirmed it beats the base decisively.
- **Voice:** `faster-whisper` for speech input; **VoxCPM2** (English) and **AI4Bharat
Indic-TTS** (Bengali) for the motherly voice output.
- **Infra:** every model runs on **Modal** (Ollama + vLLM + TTS containers, scale-to-zero
A10G/A100); the HF Space holds **zero weights** and just calls Modal functions. All
models are well under the 32B limit. Switching/serving is driven by one config object
(`core/model_config.py`).
> **Running this Space:** it calls Modal at runtime β€” set `MODAL_TOKEN_ID` and
> `MODAL_TOKEN_SECRET` as Space secrets, and deploy `core/modal_infra.py` +
> `finetune/serve_vllm.py` first.
## ▢️ Try it
1. Pick a **language** (English / বাংলা) and a **story style**.
2. **Upload 1–4 pictures** β€” a crayon drawing, a toy, a photo from the day.
3. **Type or speak** what you'd like (e.g. *"a story about my cat"*) β€” or leave it blank.
4. Hit **✨ Tell me a story** β†’ read it, **press play** to hear it, and **save** favourites.
⏱️ **The first generation can take a few minutes.** The Space scales to zero, so the
first request cold-starts the models on Modal (later ones are quick). To see it run
smoothly end-to-end, watch the **[90-second demo video](https://youtu.be/mUUmy5JwBYo)**.
## Status
Story (EN + BN), STT, and TTS all run on Modal via `core/modal_infra.py`. The Bengali
path is served by a **fine-tuned MiniCPM-V 4.5** (see below); English uses the stock
model. Every `core/` function degrades gracefully to a safe fallback if a model is
unavailable, so the app always shows a story.
## The stack
The submission runs a single stack β€” **Stack A**, the OpenBMB prize path β€” defined in
`core/model_config.py`. (The `StackConfig` machinery remains so a stack could be
swapped in, but only Stack A is shipped.)
| Layer | Model | ~Params |
|---|---|---|
| Vision + story | MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM | 8B |
| STT | faster-whisper large-v3 | 1.55B |
| English TTS | VoxCPM2 (Voice Design) | 2B |
| Bengali TTS | AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) | ~0.13B |
All OpenBMB-family core models (MiniCPM-V + VoxCPM2) β†’ eligible for the OpenBMB Best
MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali.
## Infrastructure
All model inference runs on **Modal** β€” the Gradio Space holds zero model weights
and makes zero local inference calls. Two Modal apps back the app:
- **`rupkotha`** (`core/modal_infra.py`) β€” the base runtime: vision/story
(MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English,
AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path.
- **`rupkotha-ft-serve`** (`finetune/serve_vllm.py`) β€” the **Bengali fine-tuned**
MiniCPM-V, served via **vLLM** on an A100-40GB (the full bf16 8B + vision
encoder needs more than a 24 GB card).
The `core/` wrappers call these functions remotely; switching
`COMPUTE_LOCATION` in `model_config.py` is the only change needed to run base
inference locally (requires a GPU with 8+ GB VRAM).
```bash
uv run modal deploy core/modal_infra.py # base runtime (EN vision, STT, TTS)
uv run modal deploy finetune/serve_vllm.py # Bengali fine-tuned model (vLLM)
```
## Bengali fine-tune β€” improving the OpenBMB model itself
The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled,
repetitive Bengali. Rather than swap in a different model, we **fine-tuned
MiniCPM-V itself** to fix it while keeping the rest of the stack intact. The whole
pipeline lives in `finetune/` and runs on Modal:
1. **Distill** native Bengali bedtime stories from a Gemma 3 teacher over ~450
children's drawings, filtered by a purity gate β†’ **389 high-quality examples**.
2. **LoRA fine-tune** MiniCPM-V 4.5 with **ms-SWIFT** on an A100-80GB β€” vision
encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part).
3. **Merge** the adapter into standalone weights and **serve via vLLM**.
4. **Evaluate** FT vs the stock model on held-out drawings (`finetune/eval_ft.py`):
the fine-tune wins decisively β€” coherent native ΰ¦°ΰ§‚ΰ¦ͺকΰ¦₯ΰ¦Ύ vs garbled, looping output
β€” confirmed by a Bengali speaker.
Bengali story requests now route to the fine-tuned model
(`FINETUNED_VISION_MODEL` in `model_config.py`, one-line revert to `None`); English
and audio paths are unchanged. See `finetune/README.md` for the full pipeline.
πŸ“¦ **The merged fine-tuned model is published on the Hub:**
[`debrajsingha/minicpm-v45-bengali-rupkotha`](https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha).
## Tracks & sponsor fit
Where this project lines up with the hackathon's tracks and sponsor themes:
| Track / sponsor | How it fits |
|---|---|
| Backyard AI track | A custom story generator β€” the track's own example use case |
| OpenBMB (MiniCPM) | Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali |
| Modal | All inference runs on Modal β€” base runtime + a vLLM-served fine-tuned model, plus the LoRA training |
## Environment & setup
This project uses **[uv](https://docs.astral.sh/uv/)** for all package and
environment management (Python 3.10–3.12; VoxCPM2 requires < 3.13).
```bash
uv venv --python 3.12 # create the environment
uv sync # install locked dependencies
uv run python app.py # launch the Gradio app
```
Add dependencies with `uv add <pkg>` (never `pip`). Local dev uses `uv`
(`pyproject.toml` + `uv.lock`); `requirements.txt` is intentionally minimal
(`gradio` + `modal`) β€” it's only for the HF Space, which is a thin client that
calls Modal and holds no model weights.
## Project layout
- `app.py` β€” Gradio UI (thin client; orchestration + wiring only, zero model weights).
- `core/model_config.py` β€” single source of truth for the model stack + compute config.
- `core/modal_infra.py` β€” **all** Modal GPU functions (vision, STT, TTS, translate).
- `core/vision_story.py` Β· `core/stt.py` Β· `core/tts.py` Β· `core/prompts.py` β€” thin
wrappers that build prompts and call the Modal functions.
- `finetune/` β€” the Bengali fine-tune pipeline (collect β†’ label β†’ train β†’ merge β†’
serve β†’ eval). See `finetune/README.md`.
- `assets/styles.css` β€” the night-sky / storybook theme.