Spaces:

build-small-hackathon
/

rupkotha

Running

App Files Files Community

rupkotha / README.md

Deb

Prepare for HF Space: disable mock, module-level demo, README

bf9b480 17 days ago

preview code

Raw

History Blame Contribute Delete

8.39 kB

	---
	title: Rupkotha
	emoji: 🌙
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 6.17.3
	app_file: app.py
	pinned: false
	short_description: Bedtime stories from kids' drawings, in English & Bengali
	tags:
	- backyard-ai
	- openbmb
	- modal
	- children
	- storytelling
	- bengali
	- tts
	- vision-language-model
	- track:backyard
	- sponsor:openbmb
	- sponsor:modal
	- achievement:offgrid
	- achievement:welltuned
	- achievement:offbrand
	---

	# রূপকথা · Rupkotha

	A bedtime-story app for kids. A child shows their drawings or toys, asks for a
	story in their own voice (English or Bengali), and hears it read back in a warm
	motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI).

	Inference runs on Modal (cloud GPUs) — the Gradio Space is a thin client with
	zero model weights.

	## 🏆 Build Small Hackathon — submission

	- Track: Practical · Backyard AI (a custom story generator — the track's own example use case).
	- Relevant prizes: OpenBMB Best MiniCPM Build · Modal Best Use.
	- 🎬 Demo video: https://youtu.be/mUUmy5JwBYo
	- 📣 Social post: https://www.linkedin.com/posts/debrajsingha_backyardai-gradio-edtech-share-7472381644763467776-8f3Z
	- 🤗 Fine-tuned model: https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha

	The idea. Rupkotha (রূপকথা, "fairy tale") is a bedtime-story app for 4–9 year-olds
	who can't yet type. A child shows a crayon drawing or a toy, asks for a story **by
	voice in English or Bengali**, and hears it read aloud in a warm motherly
	voice — gentle, short, always ending in sleep.

	The technical approach.
	- Vision → story: `MiniCPM-V 4.5` (8B) reads 1–4 images + the request and writes
	a 120–150-word bedtime fable.
	- Native Bengali (our differentiator): the stock model's Bengali was garbled, so
	we fine-tuned MiniCPM-V itself — distilled ~389 native Bengali stories from a
	Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on vLLM. A held-out
	eval (Bengali speaker) confirmed it beats the base decisively.
	- Voice: `faster-whisper` for speech input; VoxCPM2 (English) and **AI4Bharat
	Indic-TTS** (Bengali) for the motherly voice output.
	- Infra: every model runs on Modal (Ollama + vLLM + TTS containers, scale-to-zero
	A10G/A100); the HF Space holds zero weights and just calls Modal functions. All
	models are well under the 32B limit. Switching/serving is driven by one config object
	(`core/model_config.py`).

	> Running this Space: it calls Modal at runtime — set `MODAL_TOKEN_ID` and
	> `MODAL_TOKEN_SECRET` as Space secrets, and deploy `core/modal_infra.py` +
	> `finetune/serve_vllm.py` first.

	## ▶️ Try it

	1. Pick a language (English / বাংলা) and a story style.
	2. Upload 1–4 pictures — a crayon drawing, a toy, a photo from the day.
	3. Type or speak what you'd like (e.g. "a story about my cat") — or leave it blank.
	4. Hit ✨ Tell me a story → read it, press play to hear it, and save favourites.

	⏱️ The first generation can take a few minutes. The Space scales to zero, so the
	first request cold-starts the models on Modal (later ones are quick). To see it run
	smoothly end-to-end, watch the [90-second demo video](https://youtu.be/mUUmy5JwBYo).

	## Status

	Story (EN + BN), STT, and TTS all run on Modal via `core/modal_infra.py`. The Bengali
	path is served by a fine-tuned MiniCPM-V 4.5 (see below); English uses the stock
	model. Every `core/` function degrades gracefully to a safe fallback if a model is
	unavailable, so the app always shows a story.

	## The stack

	The submission runs a single stack — Stack A, the OpenBMB prize path — defined in
	`core/model_config.py`. (The `StackConfig` machinery remains so a stack could be
	swapped in, but only Stack A is shipped.)

	\| Layer \| Model \| ~Params \|
	\|---\|---\|---\|
	\| Vision + story \| MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM \| 8B \|
	\| STT \| faster-whisper large-v3 \| 1.55B \|
	\| English TTS \| VoxCPM2 (Voice Design) \| 2B \|
	\| Bengali TTS \| AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) \| ~0.13B \|

	All OpenBMB-family core models (MiniCPM-V + VoxCPM2) → eligible for the OpenBMB Best
	MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali.

	## Infrastructure

	All model inference runs on Modal — the Gradio Space holds zero model weights
	and makes zero local inference calls. Two Modal apps back the app:

	- `rupkotha` (`core/modal_infra.py`) — the base runtime: vision/story
	(MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English,
	AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path.
	- `rupkotha-ft-serve` (`finetune/serve_vllm.py`) — the Bengali fine-tuned
	MiniCPM-V, served via vLLM on an A100-40GB (the full bf16 8B + vision
	encoder needs more than a 24 GB card).

	The `core/` wrappers call these functions remotely; switching
	`COMPUTE_LOCATION` in `model_config.py` is the only change needed to run base
	inference locally (requires a GPU with 8+ GB VRAM).

	```bash
	uv run modal deploy core/modal_infra.py # base runtime (EN vision, STT, TTS)
	uv run modal deploy finetune/serve_vllm.py # Bengali fine-tuned model (vLLM)
	```

	## Bengali fine-tune — improving the OpenBMB model itself

	The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled,
	repetitive Bengali. Rather than swap in a different model, we **fine-tuned
	MiniCPM-V itself** to fix it while keeping the rest of the stack intact. The whole
	pipeline lives in `finetune/` and runs on Modal:

	1. Distill native Bengali bedtime stories from a Gemma 3 teacher over ~450
	children's drawings, filtered by a purity gate → 389 high-quality examples.
	2. LoRA fine-tune MiniCPM-V 4.5 with ms-SWIFT on an A100-80GB — vision
	encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part).
	3. Merge the adapter into standalone weights and serve via vLLM.
	4. Evaluate FT vs the stock model on held-out drawings (`finetune/eval_ft.py`):
	the fine-tune wins decisively — coherent native রূপকথা vs garbled, looping output
	— confirmed by a Bengali speaker.

	Bengali story requests now route to the fine-tuned model
	(`FINETUNED_VISION_MODEL` in `model_config.py`, one-line revert to `None`); English
	and audio paths are unchanged. See `finetune/README.md` for the full pipeline.

	📦 The merged fine-tuned model is published on the Hub:
	[`debrajsingha/minicpm-v45-bengali-rupkotha`](https://huggingface.co/debrajsingha/minicpm-v45-bengali-rupkotha).

	## Tracks & sponsor fit

	Where this project lines up with the hackathon's tracks and sponsor themes:

	\| Track / sponsor \| How it fits \|
	\|---\|---\|
	\| Backyard AI track \| A custom story generator — the track's own example use case \|
	\| OpenBMB (MiniCPM) \| Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali \|
	\| Modal \| All inference runs on Modal — base runtime + a vLLM-served fine-tuned model, plus the LoRA training \|

	## Environment & setup

	This project uses [uv](https://docs.astral.sh/uv/) for all package and
	environment management (Python 3.10–3.12; VoxCPM2 requires < 3.13).

	```bash
	uv venv --python 3.12 # create the environment
	uv sync # install locked dependencies
	uv run python app.py # launch the Gradio app
	```

	Add dependencies with `uv add <pkg>` (never `pip`). Local dev uses `uv`
	(`pyproject.toml` + `uv.lock`); `requirements.txt` is intentionally minimal
	(`gradio` + `modal`) — it's only for the HF Space, which is a thin client that
	calls Modal and holds no model weights.

	## Project layout

	- `app.py` — Gradio UI (thin client; orchestration + wiring only, zero model weights).
	- `core/model_config.py` — single source of truth for the model stack + compute config.
	- `core/modal_infra.py` — all Modal GPU functions (vision, STT, TTS, translate).
	- `core/vision_story.py` · `core/stt.py` · `core/tts.py` · `core/prompts.py` — thin
	wrappers that build prompts and call the Modal functions.
	- `finetune/` — the Bengali fine-tune pipeline (collect → label → train → merge →
	serve → eval). See `finetune/README.md`.
	- `assets/styles.css` — the night-sky / storybook theme.