Spaces:
Running on Zero
Running on Zero
| title: aMuseMe | |
| emoji: π΅ | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: "6.18.0" | |
| app_file: app.py | |
| python_version: "3.10" | |
| pinned: false | |
| license: mit | |
| tags: | |
| - whisper | |
| - minicpm | |
| - outlines | |
| - sd-turbo | |
| - lyric-video | |
| - kinetic-typography | |
| - audio | |
| - music | |
| - hackathon | |
| - build-small | |
| - track:wood | |
| - sponsor:openbmb | |
| - sponsor:openai | |
| - achievement:offgrid | |
| - achievement:fieldnotes | |
| short_description: AI-powered kinetic typography lyric video generator | |
| # π΅ aMuseMe β AI Lyric Video Generator | |
| **Drop a song. Watch your lyrics come alive.** | |
| aMuseMe turns a raw audio file into a fully synchronized, AI-generated kinetic typography lyric video β complete with word-level timing, mood-matched animations, and cinematic storyboard backgrounds. No manual keyframing. No lyrics needed. Just music in, video out. | |
| > Built for the [Hugging Face Build Small Hackathon](https://huggingface.co/build-small-hackathon) β **Track 2: An Adventure in Thousand Token Wood** π | |
| --- | |
| ## β¨ How It Works | |
| **Four AI models work together in a pipeline:** | |
| | Stage | Model | Params | What it does | | |
| |-------|-------|--------|-------------| | |
| | 1. Listen | **Whisper large-v3** | ~1.55B | Transcribes audio with word-level timestamps (start/end for every word) | | |
| | 2. Direct | **MiniCPM5-1B** + Outlines | ~1B | Segments words into display lines; picks mood-matched frame animations (zoom, pan, flash, fade) via JSON-schema-enforced structured generation | | |
| | 3. Illustrate | **SD-Turbo** | ~865M | Generates cinematic storyboard backgrounds from lyrics + user style prompt in a single diffusion step | | |
| | 4. Render | Pillow + FFmpeg | β | Renders 1280Γ720 frames at 30fps with word-by-word highlights, cross-fade transitions, and pipes directly to H.264 | | |
| **Total: ~3.5B parameters** β well within the 32B hackathon limit. | |
| --- | |
| ## π¬ Features | |
| - π€ **Audio-only input** β just upload a song, no lyrics needed | |
| - π€ **Word-level sync** β each word lights up precisely as it's sung, not line-by-line | |
| - π§ **LLM-directed line breaks** β MiniCPM5-1B decides where lines split for maximum readability and dramatic pacing | |
| - π¬ **Mood-matched animations** β the LLM picks zoom, pan, flash, or fade effects per line based on lyrical mood | |
| - π¨ **AI storyboard backgrounds** β SD-Turbo paints a unique backdrop for every pair of lyric lines | |
| - π **3 visual themes** β Dark (white text), Light (warm gold), Neon (cyan glow) | |
| - π€ **4 font families** β Sans Serif, Sans Serif Bold, Serif Bold, Monospace Bold | |
| - β‘ **Structured generation** β Outlines guarantees valid JSON from the LLM every time (no parsing failures) | |
| - π **VAD filtering** β Voice Activity Detection prevents hallucinated lyrics during instrumental breaks | |
| - πΈ **Demucs vocal separation** β optional vocal isolation for songs with heavy instrumentation | |
| - π **ZeroGPU compatible** β `@spaces.GPU` decorators for efficient shared-GPU execution on HF Spaces | |
| --- | |
| ## ποΈ Architecture Decisions | |
| ### Why Whisper large-v3? | |
| We started with `whisper-base` (74M) but word-level timestamp accuracy was poor for songs with fast vocals. `large-v3` (~1.55B) gives near-perfect word boundaries. We also use `faster-whisper` (CTranslate2) for 4Γ speedup over the original OpenAI implementation. | |
| ### Why MiniCPM5-1B + Outlines (not rule-based)? | |
| Rule-based line splitting (by silence gaps) produces mechanical, unnatural breaks. An LLM understands phrase structure β it knows "breaking all of these chains" should stay together, not split after "of". MiniCPM5-1B is small enough to run alongside Whisper and SD-Turbo on a single GPU. **Outlines** enforces a Pydantic JSON schema at the token level, so the model *cannot* produce invalid output β eliminating JSON parsing failures entirely. | |
| ### Why SD-Turbo (not SDXL)? | |
| SD-Turbo generates images in **a single diffusion step** (guidance_scale=0.0). For a 3-minute song with 15 storyboard images, that's ~2 seconds total. SDXL would need 20-50 steps per image β minutes instead of seconds. | |
| ### Why FFmpeg stdin pipe (not MoviePy)? | |
| We stream raw RGB bytes directly to FFmpeg via subprocess stdin. This avoids writing thousands of temp image files to disk β a massive I/O bottleneck on cloud runners. The entire assembly step is near-instantaneous. | |
| ### VAD + Condition on Previous Text | |
| We found that `condition_on_previous_text=True` dramatically improves word accuracy (Whisper uses previous lines as context), but causes infinite hallucination loops during instrumental breaks. VAD (Voice Activity Detection) with aggressive silence thresholds (2s min silence, 50ms min speech) solves this by muting non-vocal sections before they reach Whisper. | |
| --- | |
| ## π§° Tech Stack | |
| | Package | Purpose | | |
| |---------|---------| | |
| | `faster-whisper` | Word-level transcription (large-v3, ~1.55B) | | |
| | `transformers` + `accelerate` | MiniCPM5-1B for line segmentation + animation | | |
| | `outlines` | JSON schema enforcement for structured LLM output | | |
| | `diffusers` | SD-Turbo for AI storyboard backgrounds | | |
| | `pillow` | Frame rendering (1280Γ720 @ 30fps) | | |
| | `gradio` 6.18 | Web UI with custom CSS | | |
| | `spaces` | HF ZeroGPU decorator | | |
| | `demucs` | Optional vocal separation | | |
| | FFmpeg (system) | Video encoding via stdin pipe | | |
| --- | |
| ## π Hackathon Merit Badges | |
| | Badge | Status | | |
| |-------|--------| | |
| | π **Off the Grid** | β No cloud APIs β everything runs on local models | | |
| | π¨ **Off-Brand** | β Custom dark glassmorphic UI with gradient headers, custom fonts | | |
| | π **Field Notes** | β Blog post documenting architecture and learnings | | |
| --- | |
| ## π Run Locally | |
| ```bash | |
| # Clone | |
| git clone https://huggingface.co/spaces/Blazestorm001/aMuseMe | |
| cd aMuseMe | |
| # Install (requires Python 3.13+, FFmpeg, and a CUDA GPU) | |
| pip install -r requirements.txt | |
| # Launch | |
| python app.py | |
| ``` | |
| Or use `uv`: | |
| ```bash | |
| uv run gradio app.py | |
| ``` | |
| --- | |
| ## π Credits | |
| - Sample music from [Pixabay](https://pixabay.com/music/) (royalty-free) | |
| - Built with β€οΈ for the [HF Build Small Hackathon 2026](https://huggingface.co/build-small-hackathon) | |
| ## OUTPUT song VIDEO; | |
| https://youtu.be/GBOrS2fsQ2E | |
| ## APP DEMO VIDEO: | |
| https://youtu.be/6RJwgFu6LHQ | |
| ## Tested on: | |
| RTX 5060 ti 16 GB | |
| ## SOCIAL MEDIA POST: | |
| https://dev.to/blazestorm/amuseme-when-small-models-compose-a-visual-symphony-50fc | |