Spaces:
Sleeping
Sleeping
| # aMuseMe: When Small Models Compose a Visual Symphony | |
| *Field Notes from "An Adventure in Thousand Token Wood" β Build Small Hackathon 2026* | |
| --- | |
| Music videos are deeply personal artifacts. They transform a song from something you hear into something you *see* and *feel*. But creating even a simple lyric video β the kind where words appear in sync with the music β is a tedious, manual process. You're keyframing word timings in a video editor, aligning text to beats by ear, hunting for stock footage that "fits the vibe." Hours of work for a 3-minute song. | |
| We built **aMuseMe** to ask a different question: *What if you just dropped an audio file and got a complete, stylized lyric video back?* No lyrics needed. No timeline editing. No stock footage hunting. Just music in, video out. | |
| And we did it with **3.5 billion parameters total**. | |
| --- | |
| ## The Idea: Kinetic Typography Meets Small AI | |
| Kinetic typography β words that move, scale, and animate in sync with spoken audio β is one of the most engaging ways to present text on screen. Music lyric videos are a perfect application: every word has an exact timestamp, and every line has an emotional mood that could inform how it *looks*. | |
| We imagined a pipeline where: | |
| 1. An AI **listens** to the song and timestamps every word | |
| 2. An AI **reads** the lyrics and decides how to break them into display lines | |
| 3. An AI **illustrates** each section with a matching background painting | |
| 4. A renderer **animates** it all into a smooth, 30fps HD video | |
| The catch: all four steps had to fit inside the hackathon's 32B parameter budget. No cloud APIs. Everything local. | |
| --- | |
| ## The Architecture: Four Small Models, One Pipeline | |
| ### Stage 1: The Listener β Whisper large-v3 (~1.55B) | |
| We use `faster-whisper` (a CTranslate2-optimized port) to extract **word-level timestamps** from raw audio. Not sentence-level β *word-level*. When the singer says "heart" at exactly 4.72 seconds, we know it starts at 4.72s and ends at 5.01s. | |
| This precision is what makes the final video feel alive. Words don't just appear line-by-line; each word lights up at the exact millisecond it's sung. | |
| **The tuning rabbit hole:** Getting accurate word timestamps from songs (not clean speech) required extensive experimentation: | |
| - **`condition_on_previous_text=True`** dramatically improves accuracy β Whisper uses its own previous output as context, so it "remembers" the song's vocabulary. But this causes **infinite hallucination loops** during instrumental breaks (Whisper fills the silence with repeated phantom lyrics). | |
| - **VAD (Voice Activity Detection)** solves the hallucination problem. We use aggressive thresholds β `min_silence_duration_ms=2000`, `speech_pad_ms=2000`, `min_speech_duration_ms=50` β so Whisper only sees audio segments where someone is actually singing. | |
| - We started with `whisper-base` (74M params) for speed, but word boundary accuracy was poor for fast vocals. `large-v3` was the sweet spot: accurate enough for songs, and still well within the 32B budget. | |
| ### Stage 2: The Director β MiniCPM5-1B + Outlines | |
| This is the creative brain of the pipeline. Raw Whisper output is a flat list of timestamped words β but a lyric video needs *lines*. "Every heartbeat echoes feel like grooving my veins" needs to become: | |
| ``` | |
| Every heartbeat echoes β line 1 | |
| feel like grooving my veins β line 2 | |
| ``` | |
| A rule-based approach (split on silence gaps, cap at 7 words) works, but it produces mechanical, unnatural breaks. An LLM understands phrase structure β it knows "breaking all of these chains" should stay together. | |
| We use **MiniCPM5-1B** (by OpenBMB, one of the hackathon's anchor sponsors) β a 1B-parameter language model that's small enough to run alongside Whisper and SD-Turbo on a single GPU. For each chunk of ~10 words, the model: | |
| 1. **Splits words into display lines** β deciding how many words belong on each line | |
| 2. **Picks a frame animation** β `zoom_in` for emphasis, `flash` for a dramatic hit, `fade_to_black` for a quiet ending, `pan_left`/`pan_right` for gentle movement | |
| **The structured generation breakthrough:** The biggest challenge with small LLMs is output reliability. A 1B model often produces malformed JSON, missing fields, or hallucinated keys. We solved this completely with **Outlines** β a library that constrains the LLM's token generation to match a Pydantic schema at decode time. The model literally *cannot* produce invalid JSON. No retries, no regex extraction, no parsing failures. | |
| ```python | |
| from outlines import from_transformers, Generator | |
| class Frame(BaseModel): | |
| count: int # how many words on this line | |
| frame_animation: FrameAnim # zoom_in, flash, pan_left, etc. | |
| class SongFrames(BaseModel): | |
| frames: List[Frame] | |
| model = from_transformers(hf_model, tokenizer) | |
| generator = Generator(model, SongFrames) # schema-enforced! | |
| result = generator(prompt, max_new_tokens=256) | |
| # result is ALWAYS valid SongFrames β guaranteed | |
| ``` | |
| ### Stage 3: The Illustrator β SD-Turbo (~865M) | |
| For each pair of lyric lines, we generate a cinematic background image using **SD-Turbo** (Stability AI's distilled Stable Diffusion model). The magic of SD-Turbo: it generates high-quality images in **a single inference step** with `guidance_scale=0.0`. | |
| We merge the lyric text with the user's style prompt: | |
| ``` | |
| "neon-lit futuristic city at night, vibrant glowing colors, | |
| cyberpunk aesthetic, breaking all of these chains" | |
| ``` | |
| For a 3-minute song with ~15 storyboard images, the entire background generation step takes **~2 seconds on GPU**. The backgrounds are then darkened (55% overlay) so white/neon lyric text remains readable on any generated image. | |
| ### Stage 4: The Renderer β Pillow + FFmpeg | |
| The final stage is a custom frame-by-frame renderer built with Pillow: | |
| - **Word-level highlighting**: Words in the current line are shown in the theme's active color; unspoken words are dimmed. As each word's timestamp arrives, it lights up. | |
| - **Frame-level animations**: The LLM-chosen animation (zoom, pan, flash, fade) is applied to the entire text block, creating cinematic movement. | |
| - **Smart text wrapping**: Long lines automatically break across multiple rows instead of shrinking to unreadable sizes. | |
| - **Cross-fade transitions**: Background images blend smoothly with 1-second alpha transitions. | |
| The frames are streamed as raw RGB bytes directly to an FFmpeg subprocess via stdin pipe β no temp files written to disk. This avoids the I/O bottleneck that plagues cloud runners and keeps the assembly step near-instantaneous. | |
| --- | |
| ## What Makes This "Thousand Token Wood"? | |
| Track 2 asks for something **delightful that wouldn't exist without AI**. aMuseMe isn't an AI chatbot or a productivity tool β it's a creative instrument. You feed it a song, and four small AI models collaborate to produce something that would take a human editor hours: | |
| - **Would you show a friend?** Absolutely. "Drop your favorite song and get a lyric video in 90 seconds" is an instant demo. | |
| - **Is AI load-bearing?** Remove any of the four models and the experience collapses. Without Whisper, no word sync. Without MiniCPM5-1B, ugly line breaks and no animation direction. Without SD-Turbo, no visual atmosphere. | |
| - **Is it original?** We haven't seen another project that chains speech-to-text β structured LLM direction β text-to-image β kinetic typography rendering in a single pipeline. The "AI as video director" concept β where the LLM doesn't just format text but actually makes creative decisions about animation β is, to our knowledge, novel. | |
| - **Is it polished?** Three visual themes, four font families, a cyberpunk-inspired dark UI, sample songs to try instantly, and a one-click generation button. | |
| --- | |
| ## Off the Grid: No Cloud APIs | |
| The entire pipeline runs on-device. Whisper, MiniCPM5-1B, SD-Turbo, and Demucs are all local models loaded into GPU memory. No OpenAI API, no Stability API, no cloud dependencies. On HF Spaces, we use ZeroGPU (`@spaces.GPU`) for efficient shared-GPU allocation, but the computation is still happening on HF's own hardware β not calling out to external services. | |
| --- | |
| ## What We Learned | |
| 1. **Structured generation changes everything for small models.** A 1B model that always outputs valid JSON via Outlines is more reliable than a 70B model that you hope will format correctly. The constraint isn't a limitation β it's a superpower. | |
| 2. **Word-level sync is the "wow" factor.** Line-by-line lyrics feel like karaoke from 2005. Word-by-word highlighting with millisecond precision feels *magical*. The difference in viewer engagement is enormous. | |
| 3. **Whisper needs babysitting for music.** VAD, condition-on-previous-text, compression ratio thresholds, temperature scheduling β we spent more time tuning Whisper parameters than writing the renderer. Songs are fundamentally harder than speech. | |
| 4. **Pipes over disk.** Streaming raw bytes to FFmpeg via stdin was a 10Γ performance win over writing temp frames to disk. On cloud runners with slow I/O, this is the difference between a 10-second and a 100-second pipeline. | |
| 5. **One-step diffusion is a game-changer for pipelines.** SD-Turbo generating 15 images in 2 seconds means background generation is no longer a bottleneck. It's fast enough to be a utility, not a feature. | |
| --- | |
| *Try it yourself β drop a song and watch the magic happen:* | |
| π **[aMuseMe on Hugging Face Spaces](https://huggingface.co/spaces/build-small-hackathon/aMuseMe)** | |
| ## OUTPUT song VIDEO; | |
| https://youtu.be/GBOrS2fsQ2E | |
| ## APP DEMO VIDEO: | |
| https://youtu.be/6RJwgFu6LHQ | |