aMuseMe / README.md
Blazestorm001's picture
Update README.md
83ddf8a verified
|
Raw
History Blame Contribute Delete
6.35 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: aMuseMe
emoji: 🎡
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 6.18.0
app_file: app.py
python_version: '3.10'
pinned: false
license: mit
tags:
  - whisper
  - minicpm
  - outlines
  - sd-turbo
  - lyric-video
  - kinetic-typography
  - audio
  - music
  - hackathon
  - build-small
  - track:wood
  - sponsor:openbmb
  - sponsor:openai
  - achievement:offgrid
  - achievement:fieldnotes
short_description: AI-powered kinetic typography lyric video generator

🎡 aMuseMe β€” AI Lyric Video Generator

Drop a song. Watch your lyrics come alive.

aMuseMe turns a raw audio file into a fully synchronized, AI-generated kinetic typography lyric video β€” complete with word-level timing, mood-matched animations, and cinematic storyboard backgrounds. No manual keyframing. No lyrics needed. Just music in, video out.

Built for the Hugging Face Build Small Hackathon β€” Track 2: An Adventure in Thousand Token Wood πŸ„


✨ How It Works

Four AI models work together in a pipeline:

Stage Model Params What it does
1. Listen Whisper large-v3 ~1.55B Transcribes audio with word-level timestamps (start/end for every word)
2. Direct MiniCPM5-1B + Outlines ~1B Segments words into display lines; picks mood-matched frame animations (zoom, pan, flash, fade) via JSON-schema-enforced structured generation
3. Illustrate SD-Turbo ~865M Generates cinematic storyboard backgrounds from lyrics + user style prompt in a single diffusion step
4. Render Pillow + FFmpeg β€” Renders 1280Γ—720 frames at 30fps with word-by-word highlights, cross-fade transitions, and pipes directly to H.264

Total: ~3.5B parameters β€” well within the 32B hackathon limit.


🎬 Features

  • 🎀 Audio-only input β€” just upload a song, no lyrics needed
  • πŸ”€ Word-level sync β€” each word lights up precisely as it's sung, not line-by-line
  • 🧠 LLM-directed line breaks β€” MiniCPM5-1B decides where lines split for maximum readability and dramatic pacing
  • 🎬 Mood-matched animations β€” the LLM picks zoom, pan, flash, or fade effects per line based on lyrical mood
  • 🎨 AI storyboard backgrounds β€” SD-Turbo paints a unique backdrop for every pair of lyric lines
  • 🌈 3 visual themes β€” Dark (white text), Light (warm gold), Neon (cyan glow)
  • πŸ”€ 4 font families β€” Sans Serif, Sans Serif Bold, Serif Bold, Monospace Bold
  • ⚑ Structured generation β€” Outlines guarantees valid JSON from the LLM every time (no parsing failures)
  • πŸ”‡ VAD filtering β€” Voice Activity Detection prevents hallucinated lyrics during instrumental breaks
  • 🎸 Demucs vocal separation β€” optional vocal isolation for songs with heavy instrumentation
  • πŸ”Œ ZeroGPU compatible β€” @spaces.GPU decorators for efficient shared-GPU execution on HF Spaces

πŸ—οΈ Architecture Decisions

Why Whisper large-v3?

We started with whisper-base (74M) but word-level timestamp accuracy was poor for songs with fast vocals. large-v3 (~1.55B) gives near-perfect word boundaries. We also use faster-whisper (CTranslate2) for 4Γ— speedup over the original OpenAI implementation.

Why MiniCPM5-1B + Outlines (not rule-based)?

Rule-based line splitting (by silence gaps) produces mechanical, unnatural breaks. An LLM understands phrase structure β€” it knows "breaking all of these chains" should stay together, not split after "of". MiniCPM5-1B is small enough to run alongside Whisper and SD-Turbo on a single GPU. Outlines enforces a Pydantic JSON schema at the token level, so the model cannot produce invalid output β€” eliminating JSON parsing failures entirely.

Why SD-Turbo (not SDXL)?

SD-Turbo generates images in a single diffusion step (guidance_scale=0.0). For a 3-minute song with 15 storyboard images, that's ~2 seconds total. SDXL would need 20-50 steps per image β€” minutes instead of seconds.

Why FFmpeg stdin pipe (not MoviePy)?

We stream raw RGB bytes directly to FFmpeg via subprocess stdin. This avoids writing thousands of temp image files to disk β€” a massive I/O bottleneck on cloud runners. The entire assembly step is near-instantaneous.

VAD + Condition on Previous Text

We found that condition_on_previous_text=True dramatically improves word accuracy (Whisper uses previous lines as context), but causes infinite hallucination loops during instrumental breaks. VAD (Voice Activity Detection) with aggressive silence thresholds (2s min silence, 50ms min speech) solves this by muting non-vocal sections before they reach Whisper.


🧰 Tech Stack

Package Purpose
faster-whisper Word-level transcription (large-v3, ~1.55B)
transformers + accelerate MiniCPM5-1B for line segmentation + animation
outlines JSON schema enforcement for structured LLM output
diffusers SD-Turbo for AI storyboard backgrounds
pillow Frame rendering (1280Γ—720 @ 30fps)
gradio 6.18 Web UI with custom CSS
spaces HF ZeroGPU decorator
demucs Optional vocal separation
FFmpeg (system) Video encoding via stdin pipe

πŸ… Hackathon Merit Badges

Badge Status
πŸ”Œ Off the Grid βœ… No cloud APIs β€” everything runs on local models
🎨 Off-Brand βœ… Custom dark glassmorphic UI with gradient headers, custom fonts
πŸ““ Field Notes βœ… Blog post documenting architecture and learnings

πŸš€ Run Locally

# Clone
git clone https://huggingface.co/spaces/Blazestorm001/aMuseMe
cd aMuseMe

# Install (requires Python 3.13+, FFmpeg, and a CUDA GPU)
pip install -r requirements.txt

# Launch
python app.py

Or use uv:

uv run gradio app.py

πŸ“œ Credits

OUTPUT song VIDEO;

https://youtu.be/GBOrS2fsQ2E

APP DEMO VIDEO:

https://youtu.be/6RJwgFu6LHQ

Tested on:

RTX 5060 ti 16 GB

SOCIAL MEDIA POST:

https://dev.to/blazestorm/amuseme-when-small-models-compose-a-visual-symphony-50fc