Spaces:

build-small-hackathon
/

aMuseMe

Running on Zero

App Files Files Community

aMuseMe / README.md

Blazestorm001

Update README.md

83ddf8a verified 18 days ago

preview code

Raw

History Blame Contribute Delete

6.35 kB

	---
	title: aMuseMe
	emoji: 🎵
	colorFrom: purple
	colorTo: blue
	sdk: gradio
	sdk_version: "6.18.0"
	app_file: app.py
	python_version: "3.10"
	pinned: false
	license: mit
	tags:
	- whisper
	- minicpm
	- outlines
	- sd-turbo
	- lyric-video
	- kinetic-typography
	- audio
	- music
	- hackathon
	- build-small
	- track:wood
	- sponsor:openbmb
	- sponsor:openai
	- achievement:offgrid
	- achievement:fieldnotes
	short_description: AI-powered kinetic typography lyric video generator
	---

	# 🎵 aMuseMe — AI Lyric Video Generator

	Drop a song. Watch your lyrics come alive.

	aMuseMe turns a raw audio file into a fully synchronized, AI-generated kinetic typography lyric video — complete with word-level timing, mood-matched animations, and cinematic storyboard backgrounds. No manual keyframing. No lyrics needed. Just music in, video out.

	> Built for the [Hugging Face Build Small Hackathon](https://huggingface.co/build-small-hackathon) — Track 2: An Adventure in Thousand Token Wood 🍄

	---

	## ✨ How It Works



	Four AI models work together in a pipeline:

	\| Stage \| Model \| Params \| What it does \|
	\|-------\|-------\|--------\|-------------\|
	\| 1. Listen \| Whisper large-v3 \| ~1.55B \| Transcribes audio with word-level timestamps (start/end for every word) \|
	\| 2. Direct \| MiniCPM5-1B + Outlines \| ~1B \| Segments words into display lines; picks mood-matched frame animations (zoom, pan, flash, fade) via JSON-schema-enforced structured generation \|
	\| 3. Illustrate \| SD-Turbo \| ~865M \| Generates cinematic storyboard backgrounds from lyrics + user style prompt in a single diffusion step \|
	\| 4. Render \| Pillow + FFmpeg \| — \| Renders 1280×720 frames at 30fps with word-by-word highlights, cross-fade transitions, and pipes directly to H.264 \|

	Total: ~3.5B parameters — well within the 32B hackathon limit.

	---

	## 🎬 Features

	- 🎤 Audio-only input — just upload a song, no lyrics needed
	- 🔤 Word-level sync — each word lights up precisely as it's sung, not line-by-line
	- 🧠 LLM-directed line breaks — MiniCPM5-1B decides where lines split for maximum readability and dramatic pacing
	- 🎬 Mood-matched animations — the LLM picks zoom, pan, flash, or fade effects per line based on lyrical mood
	- 🎨 AI storyboard backgrounds — SD-Turbo paints a unique backdrop for every pair of lyric lines
	- 🌈 3 visual themes — Dark (white text), Light (warm gold), Neon (cyan glow)
	- 🔤 4 font families — Sans Serif, Sans Serif Bold, Serif Bold, Monospace Bold
	- ⚡ Structured generation — Outlines guarantees valid JSON from the LLM every time (no parsing failures)
	- 🔇 VAD filtering — Voice Activity Detection prevents hallucinated lyrics during instrumental breaks
	- 🎸 Demucs vocal separation — optional vocal isolation for songs with heavy instrumentation
	- 🔌 ZeroGPU compatible — `@spaces.GPU` decorators for efficient shared-GPU execution on HF Spaces

	---

	## 🏗️ Architecture Decisions

	### Why Whisper large-v3?
	We started with `whisper-base` (74M) but word-level timestamp accuracy was poor for songs with fast vocals. `large-v3` (~1.55B) gives near-perfect word boundaries. We also use `faster-whisper` (CTranslate2) for 4× speedup over the original OpenAI implementation.

	### Why MiniCPM5-1B + Outlines (not rule-based)?
	Rule-based line splitting (by silence gaps) produces mechanical, unnatural breaks. An LLM understands phrase structure — it knows "breaking all of these chains" should stay together, not split after "of". MiniCPM5-1B is small enough to run alongside Whisper and SD-Turbo on a single GPU. Outlines enforces a Pydantic JSON schema at the token level, so the model cannot produce invalid output — eliminating JSON parsing failures entirely.

	### Why SD-Turbo (not SDXL)?
	SD-Turbo generates images in a single diffusion step (guidance_scale=0.0). For a 3-minute song with 15 storyboard images, that's ~2 seconds total. SDXL would need 20-50 steps per image — minutes instead of seconds.

	### Why FFmpeg stdin pipe (not MoviePy)?
	We stream raw RGB bytes directly to FFmpeg via subprocess stdin. This avoids writing thousands of temp image files to disk — a massive I/O bottleneck on cloud runners. The entire assembly step is near-instantaneous.

	### VAD + Condition on Previous Text
	We found that `condition_on_previous_text=True` dramatically improves word accuracy (Whisper uses previous lines as context), but causes infinite hallucination loops during instrumental breaks. VAD (Voice Activity Detection) with aggressive silence thresholds (2s min silence, 50ms min speech) solves this by muting non-vocal sections before they reach Whisper.

	---

	## 🧰 Tech Stack

	\| Package \| Purpose \|
	\|---------\|---------\|
	\| `faster-whisper` \| Word-level transcription (large-v3, ~1.55B) \|
	\| `transformers` + `accelerate` \| MiniCPM5-1B for line segmentation + animation \|
	\| `outlines` \| JSON schema enforcement for structured LLM output \|
	\| `diffusers` \| SD-Turbo for AI storyboard backgrounds \|
	\| `pillow` \| Frame rendering (1280×720 @ 30fps) \|
	\| `gradio` 6.18 \| Web UI with custom CSS \|
	\| `spaces` \| HF ZeroGPU decorator \|
	\| `demucs` \| Optional vocal separation \|
	\| FFmpeg (system) \| Video encoding via stdin pipe \|

	---

	## 🏅 Hackathon Merit Badges

	\| Badge \| Status \|
	\|-------\|--------\|
	\| 🔌 Off the Grid \| ✅ No cloud APIs — everything runs on local models \|
	\| 🎨 Off-Brand \| ✅ Custom dark glassmorphic UI with gradient headers, custom fonts \|
	\| 📓 Field Notes \| ✅ Blog post documenting architecture and learnings \|

	---

	## 🚀 Run Locally

	```bash
	# Clone
	git clone https://huggingface.co/spaces/Blazestorm001/aMuseMe
	cd aMuseMe

	# Install (requires Python 3.13+, FFmpeg, and a CUDA GPU)
	pip install -r requirements.txt

	# Launch
	python app.py
	```

	Or use `uv`:
	```bash
	uv run gradio app.py
	```

	---

	## 📜 Credits

	- Sample music from [Pixabay](https://pixabay.com/music/) (royalty-free)
	- Built with ❤️ for the [HF Build Small Hackathon 2026](https://huggingface.co/build-small-hackathon)

	## OUTPUT song VIDEO;
	https://youtu.be/GBOrS2fsQ2E

	## APP DEMO VIDEO:
	https://youtu.be/6RJwgFu6LHQ

	## Tested on:
	RTX 5060 ti 16 GB

	## SOCIAL MEDIA POST:
	https://dev.to/blazestorm/amuseme-when-small-models-compose-a-visual-symphony-50fc