Spaces:

build-small-hackathon
/

aMuseMe

Running on Zero

App Files Files Community

aMuseMe / README.md

Blazestorm001

Update README.md

83ddf8a verified 18 days ago

preview code

Raw

History Blame Contribute Delete

6.35 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: aMuseMe
emoji: 🎵
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
python_version: '3.10'
pinned: false
license: mit
tags:
  - whisper
  - minicpm
  - outlines
  - sd-turbo
  - lyric-video
  - kinetic-typography
  - audio
  - music
  - hackathon
  - build-small
  - track:wood
  - sponsor:openbmb
  - sponsor:openai
  - achievement:offgrid
  - achievement:fieldnotes
short_description: AI-powered kinetic typography lyric video generator

🎵 aMuseMe — AI Lyric Video Generator

Drop a song. Watch your lyrics come alive.

aMuseMe turns a raw audio file into a fully synchronized, AI-generated kinetic typography lyric video — complete with word-level timing, mood-matched animations, and cinematic storyboard backgrounds. No manual keyframing. No lyrics needed. Just music in, video out.

Built for the Hugging Face Build Small Hackathon — Track 2: An Adventure in Thousand Token Wood 🍄

✨ How It Works

Four AI models work together in a pipeline:

Stage	Model	Params	What it does
1. Listen	Whisper large-v3	~1.55B	Transcribes audio with word-level timestamps (start/end for every word)
2. Direct	MiniCPM5-1B + Outlines	~1B	Segments words into display lines; picks mood-matched frame animations (zoom, pan, flash, fade) via JSON-schema-enforced structured generation
3. Illustrate	SD-Turbo	~865M	Generates cinematic storyboard backgrounds from lyrics + user style prompt in a single diffusion step
4. Render	Pillow + FFmpeg	—	Renders 1280×720 frames at 30fps with word-by-word highlights, cross-fade transitions, and pipes directly to H.264

Total: ~3.5B parameters — well within the 32B hackathon limit.

🎬 Features

🎤 Audio-only input — just upload a song, no lyrics needed
🔤 Word-level sync — each word lights up precisely as it's sung, not line-by-line
🧠 LLM-directed line breaks — MiniCPM5-1B decides where lines split for maximum readability and dramatic pacing
🎬 Mood-matched animations — the LLM picks zoom, pan, flash, or fade effects per line based on lyrical mood
🎨 AI storyboard backgrounds — SD-Turbo paints a unique backdrop for every pair of lyric lines
🌈 3 visual themes — Dark (white text), Light (warm gold), Neon (cyan glow)
🔤 4 font families — Sans Serif, Sans Serif Bold, Serif Bold, Monospace Bold
⚡ Structured generation — Outlines guarantees valid JSON from the LLM every time (no parsing failures)
🔇 VAD filtering — Voice Activity Detection prevents hallucinated lyrics during instrumental breaks
🎸 Demucs vocal separation — optional vocal isolation for songs with heavy instrumentation
🔌 ZeroGPU compatible — @spaces.GPU decorators for efficient shared-GPU execution on HF Spaces

🏗️ Architecture Decisions

Why Whisper large-v3?

We started with whisper-base (74M) but word-level timestamp accuracy was poor for songs with fast vocals. large-v3 (~1.55B) gives near-perfect word boundaries. We also use faster-whisper (CTranslate2) for 4× speedup over the original OpenAI implementation.

Why MiniCPM5-1B + Outlines (not rule-based)?

Rule-based line splitting (by silence gaps) produces mechanical, unnatural breaks. An LLM understands phrase structure — it knows "breaking all of these chains" should stay together, not split after "of". MiniCPM5-1B is small enough to run alongside Whisper and SD-Turbo on a single GPU. Outlines enforces a Pydantic JSON schema at the token level, so the model cannot produce invalid output — eliminating JSON parsing failures entirely.

Why SD-Turbo (not SDXL)?

SD-Turbo generates images in a single diffusion step (guidance_scale=0.0). For a 3-minute song with 15 storyboard images, that's ~2 seconds total. SDXL would need 20-50 steps per image — minutes instead of seconds.

Why FFmpeg stdin pipe (not MoviePy)?

We stream raw RGB bytes directly to FFmpeg via subprocess stdin. This avoids writing thousands of temp image files to disk — a massive I/O bottleneck on cloud runners. The entire assembly step is near-instantaneous.

VAD + Condition on Previous Text

We found that condition_on_previous_text=True dramatically improves word accuracy (Whisper uses previous lines as context), but causes infinite hallucination loops during instrumental breaks. VAD (Voice Activity Detection) with aggressive silence thresholds (2s min silence, 50ms min speech) solves this by muting non-vocal sections before they reach Whisper.

🧰 Tech Stack

Package	Purpose
`faster-whisper`	Word-level transcription (large-v3, ~1.55B)
`transformers` + `accelerate`	MiniCPM5-1B for line segmentation + animation
`outlines`	JSON schema enforcement for structured LLM output
`diffusers`	SD-Turbo for AI storyboard backgrounds
`pillow`	Frame rendering (1280×720 @ 30fps)
`gradio` 6.18	Web UI with custom CSS
`spaces`	HF ZeroGPU decorator
`demucs`	Optional vocal separation
FFmpeg (system)	Video encoding via stdin pipe

🏅 Hackathon Merit Badges

Badge	Status
🔌 Off the Grid	✅ No cloud APIs — everything runs on local models
🎨 Off-Brand	✅ Custom dark glassmorphic UI with gradient headers, custom fonts
📓 Field Notes	✅ Blog post documenting architecture and learnings

🚀 Run Locally

# Clone
git clone https://huggingface.co/spaces/Blazestorm001/aMuseMe
cd aMuseMe

# Install (requires Python 3.13+, FFmpeg, and a CUDA GPU)
pip install -r requirements.txt

# Launch
python app.py

Or use uv:

uv run gradio app.py

📜 Credits

Sample music from Pixabay (royalty-free)
Built with ❤️ for the HF Build Small Hackathon 2026

OUTPUT song VIDEO;

https://youtu.be/GBOrS2fsQ2E

APP DEMO VIDEO:

https://youtu.be/6RJwgFu6LHQ

Tested on:

RTX 5060 ti 16 GB

SOCIAL MEDIA POST:

https://dev.to/blazestorm/amuseme-when-small-models-compose-a-visual-symphony-50fc