Spaces:

build-small-hackathon
/

rupkotha

Running

App Files Files Community

rupkotha / README.md

Deb

Prepare for HF Space: disable mock, module-level demo, README

bf9b480 17 days ago

preview code

Raw

History Blame Contribute Delete

8.39 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: Rupkotha
emoji: 🌙
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: false
short_description: Bedtime stories from kids' drawings, in English & Bengali
tags:
  - backyard-ai
  - openbmb
  - modal
  - children
  - storytelling
  - bengali
  - tts
  - vision-language-model
  - track:backyard
  - sponsor:openbmb
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand

রূপকথা · Rupkotha

A bedtime-story app for kids. A child shows their drawings or toys, asks for a story in their own voice (English or Bengali), and hears it read back in a warm motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI).

Inference runs on Modal (cloud GPUs) — the Gradio Space is a thin client with zero model weights.

🏆 Build Small Hackathon — submission

Track: Practical · Backyard AI (a custom story generator — the track's own example use case).
Relevant prizes: OpenBMB Best MiniCPM Build · Modal Best Use.
🎬 Demo video: https://youtu.be/mUUmy5JwBYo
📣 Social post: https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z
🤗 Fine-tuned model: https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha

The idea. Rupkotha (রূপকথা, "fairy tale") is a bedtime-story app for 4–9 year-olds who can't yet type. A child shows a crayon drawing or a toy, asks for a story by voice in English or Bengali, and hears it read aloud in a warm motherly voice — gentle, short, always ending in sleep.

The technical approach.

Vision → story: MiniCPM-V 4.5 (8B) reads 1–4 images + the request and writes a 120–150-word bedtime fable.
Native Bengali (our differentiator): the stock model's Bengali was garbled, so we fine-tuned MiniCPM-V itself — distilled ~389 native Bengali stories from a Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on vLLM. A held-out eval (Bengali speaker) confirmed it beats the base decisively.
Voice: faster-whisper for speech input; VoxCPM2 (English) and AI4Bharat Indic-TTS (Bengali) for the motherly voice output.
Infra: every model runs on Modal (Ollama + vLLM + TTS containers, scale-to-zero A10G/A100); the HF Space holds zero weights and just calls Modal functions. All models are well under the 32B limit. Switching/serving is driven by one config object (core/model_config.py).

Running this Space: it calls Modal at runtime — set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET as Space secrets, and deploy core/modal_infra.py + finetune/serve_vllm.py first.

▶️ Try it

Pick a language (English / বাংলা) and a story style.
Upload 1–4 pictures — a crayon drawing, a toy, a photo from the day.
Type or speak what you'd like (e.g. "a story about my cat") — or leave it blank.
Hit ✨ Tell me a story → read it, press play to hear it, and save favourites.

⏱️ The first generation can take a few minutes. The Space scales to zero, so the first request cold-starts the models on Modal (later ones are quick). To see it run smoothly end-to-end, watch the 90-second demo video.

Status

Story (EN + BN), STT, and TTS all run on Modal via core/modal_infra.py. The Bengali path is served by a fine-tuned MiniCPM-V 4.5 (see below); English uses the stock model. Every core/ function degrades gracefully to a safe fallback if a model is unavailable, so the app always shows a story.

The stack

The submission runs a single stack — Stack A, the OpenBMB prize path — defined in core/model_config.py. (The StackConfig machinery remains so a stack could be swapped in, but only Stack A is shipped.)

Layer	Model	~Params
Vision + story	MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM	8B
STT	faster-whisper large-v3	1.55B
English TTS	VoxCPM2 (Voice Design)	2B
Bengali TTS	AI4Bharat Indic-TTS (FastPitch + HiFi-GAN)	~0.13B

All OpenBMB-family core models (MiniCPM-V + VoxCPM2) → eligible for the OpenBMB Best MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali.

Infrastructure

All model inference runs on Modal — the Gradio Space holds zero model weights and makes zero local inference calls. Two Modal apps back the app:

rupkotha (core/modal_infra.py) — the base runtime: vision/story (MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English, AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path.
rupkotha-ft-serve (finetune/serve_vllm.py) — the Bengali fine-tuned MiniCPM-V, served via vLLM on an A100-40GB (the full bf16 8B + vision encoder needs more than a 24 GB card).

The core/ wrappers call these functions remotely; switching COMPUTE_LOCATION in model_config.py is the only change needed to run base inference locally (requires a GPU with 8+ GB VRAM).

uv run modal deploy core/modal_infra.py        # base runtime (EN vision, STT, TTS)
uv run modal deploy finetune/serve_vllm.py     # Bengali fine-tuned model (vLLM)

Bengali fine-tune — improving the OpenBMB model itself

The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled, repetitive Bengali. Rather than swap in a different model, we fine-tuned MiniCPM-V itself to fix it while keeping the rest of the stack intact. The whole pipeline lives in finetune/ and runs on Modal:

Distill native Bengali bedtime stories from a Gemma 3 teacher over ~450 children's drawings, filtered by a purity gate → 389 high-quality examples.
LoRA fine-tune MiniCPM-V 4.5 with ms-SWIFT on an A100-80GB — vision encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part).
Merge the adapter into standalone weights and serve via vLLM.
Evaluate FT vs the stock model on held-out drawings (finetune/eval_ft.py): the fine-tune wins decisively — coherent native রূপকথা vs garbled, looping output — confirmed by a Bengali speaker.

Bengali story requests now route to the fine-tuned model (FINETUNED_VISION_MODEL in model_config.py, one-line revert to None); English and audio paths are unchanged. See finetune/README.md for the full pipeline.

📦 The merged fine-tuned model is published on the Hub: debrajsingha/minicpm-v45-bengali-rupkotha.

Tracks & sponsor fit

Where this project lines up with the hackathon's tracks and sponsor themes:

Track / sponsor	How it fits
Backyard AI track	A custom story generator — the track's own example use case
OpenBMB (MiniCPM)	Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali
Modal	All inference runs on Modal — base runtime + a vLLM-served fine-tuned model, plus the LoRA training

Environment & setup

This project uses uv for all package and environment management (Python 3.10–3.12; VoxCPM2 requires < 3.13).

uv venv --python 3.12      # create the environment
uv sync                    # install locked dependencies
uv run python app.py       # launch the Gradio app

Add dependencies with uv add <pkg> (never pip). Local dev uses uv (pyproject.toml + uv.lock); requirements.txt is intentionally minimal (gradio + modal) — it's only for the HF Space, which is a thin client that calls Modal and holds no model weights.

Project layout

app.py — Gradio UI (thin client; orchestration + wiring only, zero model weights).
core/model_config.py — single source of truth for the model stack + compute config.
core/modal_infra.py — all Modal GPU functions (vision, STT, TTS, translate).
core/vision_story.py · core/stt.py · core/tts.py · core/prompts.py — thin wrappers that build prompts and call the Modal functions.
finetune/ — the Bengali fine-tune pipeline (collect → label → train → merge → serve → eval). See finetune/README.md.
assets/styles.css — the night-sky / storybook theme.