rupkotha / README.md
Deb
Prepare for HF Space: disable mock, module-level demo, README
bf9b480
|
Raw
History Blame Contribute Delete
8.39 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: Rupkotha
emoji: πŸŒ™
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 6.17.3
app_file: app.py
pinned: false
short_description: Bedtime stories from kids' drawings, in English & Bengali
tags:
  - backyard-ai
  - openbmb
  - modal
  - children
  - storytelling
  - bengali
  - tts
  - vision-language-model
  - track:backyard
  - sponsor:openbmb
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand

ΰ¦°ΰ§‚ΰ¦ͺকΰ¦₯ΰ¦Ύ Β· Rupkotha

A bedtime-story app for kids. A child shows their drawings or toys, asks for a story in their own voice (English or Bengali), and hears it read back in a warm motherly voice. Built for the Build Small Hackathon (Track 1: Backyard AI).

Inference runs on Modal (cloud GPUs) β€” the Gradio Space is a thin client with zero model weights.

πŸ† Build Small Hackathon β€” submission

The idea. Rupkotha (ΰ¦°ΰ§‚ΰ¦ͺকΰ¦₯ΰ¦Ύ, "fairy tale") is a bedtime-story app for 4–9 year-olds who can't yet type. A child shows a crayon drawing or a toy, asks for a story by voice in English or Bengali, and hears it read aloud in a warm motherly voice β€” gentle, short, always ending in sleep.

The technical approach.

  • Vision β†’ story: MiniCPM-V 4.5 (8B) reads 1–4 images + the request and writes a 120–150-word bedtime fable.
  • Native Bengali (our differentiator): the stock model's Bengali was garbled, so we fine-tuned MiniCPM-V itself β€” distilled ~389 native Bengali stories from a Gemma teacher, LoRA-trained via ms-SWIFT, merged, and served on vLLM. A held-out eval (Bengali speaker) confirmed it beats the base decisively.
  • Voice: faster-whisper for speech input; VoxCPM2 (English) and AI4Bharat Indic-TTS (Bengali) for the motherly voice output.
  • Infra: every model runs on Modal (Ollama + vLLM + TTS containers, scale-to-zero A10G/A100); the HF Space holds zero weights and just calls Modal functions. All models are well under the 32B limit. Switching/serving is driven by one config object (core/model_config.py).

Running this Space: it calls Modal at runtime β€” set MODAL_TOKEN_ID and MODAL_TOKEN_SECRET as Space secrets, and deploy core/modal_infra.py + finetune/serve_vllm.py first.

▢️ Try it

  1. Pick a language (English / বাংলা) and a story style.
  2. Upload 1–4 pictures β€” a crayon drawing, a toy, a photo from the day.
  3. Type or speak what you'd like (e.g. "a story about my cat") β€” or leave it blank.
  4. Hit ✨ Tell me a story β†’ read it, press play to hear it, and save favourites.

⏱️ The first generation can take a few minutes. The Space scales to zero, so the first request cold-starts the models on Modal (later ones are quick). To see it run smoothly end-to-end, watch the 90-second demo video.

Status

Story (EN + BN), STT, and TTS all run on Modal via core/modal_infra.py. The Bengali path is served by a fine-tuned MiniCPM-V 4.5 (see below); English uses the stock model. Every core/ function degrades gracefully to a safe fallback if a model is unavailable, so the app always shows a story.

The stack

The submission runs a single stack β€” Stack A, the OpenBMB prize path β€” defined in core/model_config.py. (The StackConfig machinery remains so a stack could be swapped in, but only Stack A is shipped.)

Layer Model ~Params
Vision + story MiniCPM-V 4.5 (Bengali fine-tuned), via Ollama / vLLM 8B
STT faster-whisper large-v3 1.55B
English TTS VoxCPM2 (Voice Design) 2B
Bengali TTS AI4Bharat Indic-TTS (FastPitch + HiFi-GAN) ~0.13B

All OpenBMB-family core models (MiniCPM-V + VoxCPM2) β†’ eligible for the OpenBMB Best MiniCPM Build prize, strengthened by fine-tuning MiniCPM-V itself for Bengali.

Infrastructure

All model inference runs on Modal β€” the Gradio Space holds zero model weights and makes zero local inference calls. Two Modal apps back the app:

  • rupkotha (core/modal_infra.py) β€” the base runtime: vision/story (MiniCPM-V 4.5 via Ollama), STT (faster-whisper), TTS (VoxCPM2 for English, AI4Bharat Indic-TTS for Bengali), and the IndicTrans2 translation path.
  • rupkotha-ft-serve (finetune/serve_vllm.py) β€” the Bengali fine-tuned MiniCPM-V, served via vLLM on an A100-40GB (the full bf16 8B + vision encoder needs more than a 24 GB card).

The core/ wrappers call these functions remotely; switching COMPUTE_LOCATION in model_config.py is the only change needed to run base inference locally (requires a GPU with 8+ GB VRAM).

uv run modal deploy core/modal_infra.py        # base runtime (EN vision, STT, TTS)
uv run modal deploy finetune/serve_vllm.py     # Bengali fine-tuned model (vLLM)

Bengali fine-tune β€” improving the OpenBMB model itself

The one weak spot was native Bengali: stock MiniCPM-V 4.5 produced garbled, repetitive Bengali. Rather than swap in a different model, we fine-tuned MiniCPM-V itself to fix it while keeping the rest of the stack intact. The whole pipeline lives in finetune/ and runs on Modal:

  1. Distill native Bengali bedtime stories from a Gemma 3 teacher over ~450 children's drawings, filtered by a purity gate β†’ 389 high-quality examples.
  2. LoRA fine-tune MiniCPM-V 4.5 with ms-SWIFT on an A100-80GB β€” vision encoder frozen, LoRA (r=16) on the LLM self-attention only (the weak-Bengali part).
  3. Merge the adapter into standalone weights and serve via vLLM.
  4. Evaluate FT vs the stock model on held-out drawings (finetune/eval_ft.py): the fine-tune wins decisively β€” coherent native ΰ¦°ΰ§‚ΰ¦ͺকΰ¦₯ΰ¦Ύ vs garbled, looping output β€” confirmed by a Bengali speaker.

Bengali story requests now route to the fine-tuned model (FINETUNED_VISION_MODEL in model_config.py, one-line revert to None); English and audio paths are unchanged. See finetune/README.md for the full pipeline.

πŸ“¦ The merged fine-tuned model is published on the Hub: debrajsingha/minicpm-v45-bengali-rupkotha.

Tracks & sponsor fit

Where this project lines up with the hackathon's tracks and sponsor themes:

Track / sponsor How it fits
Backyard AI track A custom story generator β€” the track's own example use case
OpenBMB (MiniCPM) Core models are MiniCPM-V 4.5 + VoxCPM2; MiniCPM-V was also fine-tuned for Bengali
Modal All inference runs on Modal β€” base runtime + a vLLM-served fine-tuned model, plus the LoRA training

Environment & setup

This project uses uv for all package and environment management (Python 3.10–3.12; VoxCPM2 requires < 3.13).

uv venv --python 3.12      # create the environment
uv sync                    # install locked dependencies
uv run python app.py       # launch the Gradio app

Add dependencies with uv add <pkg> (never pip). Local dev uses uv (pyproject.toml + uv.lock); requirements.txt is intentionally minimal (gradio + modal) β€” it's only for the HF Space, which is a thin client that calls Modal and holds no model weights.

Project layout

  • app.py β€” Gradio UI (thin client; orchestration + wiring only, zero model weights).
  • core/model_config.py β€” single source of truth for the model stack + compute config.
  • core/modal_infra.py β€” all Modal GPU functions (vision, STT, TTS, translate).
  • core/vision_story.py Β· core/stt.py Β· core/tts.py Β· core/prompts.py β€” thin wrappers that build prompts and call the Modal functions.
  • finetune/ β€” the Bengali fine-tune pipeline (collect β†’ label β†’ train β†’ merge β†’ serve β†’ eval). See finetune/README.md.
  • assets/styles.css β€” the night-sky / storybook theme.