Spaces:
Running
A newer version of the Gradio SDK is available: 6.19.0
title: Rupkotha
emoji: π
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: false
short_description: Bedtime stories from kids' drawings, in English & Bengali
tags:
- backyard-ai
- openbmb
- modal
- children
- storytelling
- bengali
- tts
- vision-language-model
- track:backyard
- sponsor:openbmb
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
ΰ¦°ΰ§ΰ¦ͺΰ¦ΰ¦₯ΰ¦Ύ Β· Rupkotha
A bedtime-story app for kids. A child shows their drawings or toys, asks for a story in their own voice (English or Bengali), and hears it read back in a warm motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI).
Inference runs on Modal (cloud GPUs) β the Gradio Space is a thin client with zero model weights.
π Build Small Hackathon β submission
- Track: Practical Β· Backyard AI (a custom story generator β the track's own example use case).
- Relevant prizes: OpenBMB Best MiniCPM Build Β· Modal Best Use.
- π¬ Demo video: https://youtu.be/mUUmy5JwBYo
- π£ Social post: https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z
- π€ Fine-tuned model: https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha
The idea. Rupkotha (ΰ¦°ΰ§ΰ¦ͺΰ¦ΰ¦₯ΰ¦Ύ, "fairy tale") is a bedtime-story app for 4β9 year-olds who can't yet type. A child shows a crayon drawing or a toy, asks for a story by voice in English or Bengali, and hears it read aloud in a warm motherly voice β gentle, short, always ending in sleep.
The technical approach.
- Vision β story:
MiniCPM-V 4.5(8B) reads 1β4 images + the request and writes a 120β150-word bedtime fable. - Native Bengali (our differentiator): the stock model's Bengali was garbled, so we fine-tuned MiniCPM-V itself β distilled ~389 native Bengali stories from a Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on vLLM. A held-out eval (Bengali speaker) confirmed it beats the base decisively.
- Voice:
faster-whisperfor speech input; VoxCPM2 (English) and AI4Bharat Indic-TTS (Bengali) for the motherly voice output. - Infra: every model runs on Modal (Ollama + vLLM + TTS containers, scale-to-zero
A10G/A100); the HF Space holds zero weights and just calls Modal functions. All
models are well under the 32B limit. Switching/serving is driven by one config object
(
core/model_config.py).
Running this Space: it calls Modal at runtime β set
MODAL_TOKEN_IDandMODAL_TOKEN_SECRETas Space secrets, and deploycore/modal_infra.py+finetune/serve_vllm.pyfirst.
βΆοΈ Try it
- Pick a language (English / বাΰ¦ΰ¦²ΰ¦Ύ) and a story style.
- Upload 1β4 pictures β a crayon drawing, a toy, a photo from the day.
- Type or speak what you'd like (e.g. "a story about my cat") β or leave it blank.
- Hit β¨ Tell me a story β read it, press play to hear it, and save favourites.
β±οΈ The first generation can take a few minutes. The Space scales to zero, so the first request cold-starts the models on Modal (later ones are quick). To see it run smoothly end-to-end, watch the 90-second demo video.
Status
Story (EN + BN), STT, and TTS all run on Modal via core/modal_infra.py. The Bengali
path is served by a fine-tuned MiniCPM-V 4.5 (see below); English uses the stock
model. Every core/ function degrades gracefully to a safe fallback if a model is
unavailable, so the app always shows a story.
The stack
The submission runs a single stack β Stack A, the OpenBMB prize path β defined in
core/model_config.py. (The StackConfig machinery remains so a stack could be
swapped in, but only Stack A is shipped.)
| Layer | Model | ~Params |
|---|---|---|
| Vision + story | MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM | 8B |
| STT | faster-whisper large-v3 | 1.55B |
| English TTS | VoxCPM2 (Voice Design) | 2B |
| Bengali TTS | AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) | ~0.13B |
All OpenBMB-family core models (MiniCPM-V + VoxCPM2) β eligible for the OpenBMB Best MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali.
Infrastructure
All model inference runs on Modal β the Gradio Space holds zero model weights and makes zero local inference calls. Two Modal apps back the app:
rupkotha(core/modal_infra.py) β the base runtime: vision/story (MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English, AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path.rupkotha-ft-serve(finetune/serve_vllm.py) β the Bengali fine-tuned MiniCPM-V, served via vLLM on an A100-40GB (the full bf16 8B + vision encoder needs more than a 24 GB card).
The core/ wrappers call these functions remotely; switching
COMPUTE_LOCATION in model_config.py is the only change needed to run base
inference locally (requires a GPU with 8+ GB VRAM).
uv run modal deploy core/modal_infra.py # base runtime (EN vision, STT, TTS)
uv run modal deploy finetune/serve_vllm.py # Bengali fine-tuned model (vLLM)
Bengali fine-tune β improving the OpenBMB model itself
The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled,
repetitive Bengali. Rather than swap in a different model, we fine-tuned
MiniCPM-V itself to fix it while keeping the rest of the stack intact. The whole
pipeline lives in finetune/ and runs on Modal:
- Distill native Bengali bedtime stories from a Gemma 3 teacher over ~450 children's drawings, filtered by a purity gate β 389 high-quality examples.
- LoRA fine-tune MiniCPM-V 4.5 with ms-SWIFT on an A100-80GB β vision encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part).
- Merge the adapter into standalone weights and serve via vLLM.
- Evaluate FT vs the stock model on held-out drawings (
finetune/eval_ft.py): the fine-tune wins decisively β coherent native ΰ¦°ΰ§ΰ¦ͺΰ¦ΰ¦₯ΰ¦Ύ vs garbled, looping output β confirmed by a Bengali speaker.
Bengali story requests now route to the fine-tuned model
(FINETUNED_VISION_MODEL in model_config.py, one-line revert to None); English
and audio paths are unchanged. See finetune/README.md for the full pipeline.
π¦ The merged fine-tuned model is published on the Hub:
debrajsingha/minicpm-v45-bengali-rupkotha.
Tracks & sponsor fit
Where this project lines up with the hackathon's tracks and sponsor themes:
| Track / sponsor | How it fits |
|---|---|
| Backyard AI track | A custom story generator β the track's own example use case |
| OpenBMB (MiniCPM) | Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali |
| Modal | All inference runs on Modal β base runtime + a vLLM-served fine-tuned model, plus the LoRA training |
Environment & setup
This project uses uv for all package and environment management (Python 3.10β3.12; VoxCPM2 requires < 3.13).
uv venv --python 3.12 # create the environment
uv sync # install locked dependencies
uv run python app.py # launch the Gradio app
Add dependencies with uv add <pkg> (never pip). Local dev uses uv
(pyproject.toml + uv.lock); requirements.txt is intentionally minimal
(gradio + modal) β it's only for the HF Space, which is a thin client that
calls Modal and holds no model weights.
Project layout
app.pyβ Gradio UI (thin client; orchestration + wiring only, zero model weights).core/model_config.pyβ single source of truth for the model stack + compute config.core/modal_infra.pyβ all Modal GPU functions (vision, STT, TTS, translate).core/vision_story.pyΒ·core/stt.pyΒ·core/tts.pyΒ·core/prompts.pyβ thin wrappers that build prompts and call the Modal functions.finetune/β the Bengali fine-tune pipeline (collect β label β train β merge β serve β eval). Seefinetune/README.md.assets/styles.cssβ the night-sky / storybook theme.