--- tags: - ml-intern --- # 🎡 lyric-sync **Automatic perfect song lyric acquisition and synchronization.** Produces word-level synchronized lyrics with sub-10ms precision from any audio file. ## Pipeline Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ lyric-sync Pipeline β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Input β”‚ β”‚ Demucs β”‚ β”‚ WhisperX β”‚ β”‚ Output β”‚ β”‚ β”‚ β”‚ Audio │───▢│ Vocals │───▢│Transcribe│───▢│ Synced β”‚ β”‚ β”‚ β”‚ (mix) β”‚ β”‚Separationβ”‚ β”‚ + Timing β”‚ β”‚ Lyrics β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β–² β–² β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚AcoustID │───▢│ Fetch β”‚ β”‚Align ASR β”‚ β”‚ Refine β”‚ β”‚ β”‚ β”‚ Identify β”‚ β”‚Reference │───▢│to Lyrics │───▢│ Onsets/ β”‚ β”‚ β”‚ β”‚ Song β”‚ β”‚ Lyrics β”‚ β”‚(transfer β”‚ β”‚ Offsets β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ timings) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό (fallback) β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚Transcriptβ”‚ β”‚ β”‚ β”‚ Search β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Steps in Detail ### 1. Song Identification - **Primary**: Audio fingerprinting via Chromaprint/fpcalc β†’ AcoustID lookup β†’ MusicBrainz metadata - **Fallback**: Transcribe vocals β†’ search lyrics databases (LRCLIB, Genius) for matching text ### 2. Vocal Stem Separation (Demucs) - Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ) - Produces clean vocal track, dramatically improving downstream ASR accuracy - Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing ### 3. Word-Level Transcription (WhisperX) - **WhisperX** (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment - Decoupled approach is robust to timing drift on stretched/sung syllables - Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1 ### 4. Reference Lyrics Acquisition - **LRCLIB** (free, no auth): Community-maintained LRC database with synced timestamps - **syncedlyrics** (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz - **Genius** (fallback): Plain text lyrics, requires API key ### 5. Sequence Alignment (ASR β†’ Reference) - Maps imperfect ASR output onto correct reference lyrics text - Uses `difflib.SequenceMatcher` (LCS-based global alignment) - Handles: exact matches (direct transfer), substitutions (interpolation), gaps - Optional fuzzy pre-pass for phonetic ASR errors ("gonna" β†’ "going to") - Handles repeated sections (chorus/verse) via sectional alignment ### 6. Timing Refinement (Audio Analysis) - **Onset detection**: Spectral flux + librosa ODF β†’ snap word starts to actual sound onsets - **Energy envelope**: RMS decay β†’ find precise word endings - **Silence gaps**: Detect inter-word pauses β†’ refine boundaries - **Backtracking**: Snaps to the energy trough preceding each onset (true word start) - Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256) ## Installation ```bash # Core (separation + refinement) pip install lyric-sync # With WhisperX transcription (recommended) pip install lyric-sync[whisperx] # With song identification pip install lyric-sync[identify] # Everything pip install lyric-sync[all] # System dependency: chromaprint (for AcoustID fingerprinting) # Ubuntu/Debian: sudo apt-get install chromaprint-tools ffmpeg # macOS: brew install chromaprint ffmpeg ``` ## Usage ### CLI ```bash # Full automatic (identify + fetch lyrics + sync) lyric-sync song.mp3 --acoustid-key YOUR_KEY -v # With known metadata (faster, skips fingerprinting) lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc # JSON output for apps lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json # ASS karaoke subtitles lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass # CPU-only processing (slower but no GPU needed) lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song" ``` ### Python API ```python from lyric_sync import LyricSyncPipeline # Initialize pipeline = LyricSyncPipeline( acoustid_key="YOUR_ACOUSTID_KEY", # optional device="cuda", # or "cpu" ) # Full auto result = pipeline.sync("song.mp3") # With known metadata result = pipeline.sync( "song.mp3", artist="Radiohead", title="Creep", ) # Access results print(result.song) # SongIdentification(title=..., artist=...) print(result.quality_score) # 0.85 (0-1 quality estimate) # Export print(result.to_lrc()) # Enhanced LRC with word-level timestamps print(result.to_json()) # JSON array of {word, start, end, confidence} print(result.to_srt()) # SRT subtitles print(result.to_ass()) # ASS karaoke with \k tags ``` ### Step-by-Step (Advanced) ```python from lyric_sync.separate import VocalSeparator from lyric_sync.transcribe import transcribe_vocals from lyric_sync.lyrics import fetch_lyrics from lyric_sync.align import align_words from lyric_sync.refine import refine_timings # 1. Separate vocals separator = VocalSeparator(device="cuda") vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000) vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3") # 2. Transcribe transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx") # 3. Fetch lyrics lyrics = fetch_lyrics(artist="Radiohead", title="Creep") # 4. Align aligned_words, stats = align_words( asr_words=transcript.words, ref_words=lyrics.words, ) # 5. Refine refined_words = refine_timings(vocals_full, sr_full, aligned_words) ``` ## Output Formats | Format | Description | Use Case | |--------|-------------|----------| | `lrc` (enhanced) | `[MM:SS.cc] word ...` | Music players with word-level sync | | `lrc_standard` | `[MM:SS.cc] Line of text` | Standard music players | | `json` | `[{"word": ..., "start": ..., "end": ...}]` | Apps, programmatic use | | `srt` | Standard SRT subtitles | Video players | | `ass` | ASS with `\kf` karaoke tags | Karaoke / video editing | ## Configuration ### Environment Variables | Variable | Description | |----------|-------------| | `ACOUSTID_API_KEY` | AcoustID API key (free, register at acoustid.org) | | `GENIUS_TOKEN` | Genius API token (free, for plain lyrics fallback) | ### Hardware Requirements | Component | GPU (CUDA) | CPU | |-----------|-----------|-----| | Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower | | WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower | | **Total** | **~10-12 GB VRAM** | **~16 GB RAM** | | Processing time (4min song) | ~30-60s | ~5-10 min | ### Transcription Backends | Backend | Quality (singing) | Speed | Dependencies | |---------|----------|-------|--------------| | **WhisperX** ⭐ | Best (phoneme alignment) | Fast (batched) | `whisperx` | | Whisper (pipeline) | Good (attention-based) | Fast | `transformers` | | Granite Speech | Unknown (speech-trained) | Medium | `transformers` | ## How It Works (Technical) ### Alignment Algorithm The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps on the *correct* lyrics. We solve this with sequence alignment: 1. **Normalize** both word sequences (lowercase, strip punctuation, expand contractions) 2. **Fuzzy pre-pass**: Map phonetically similar ASR words to their reference equivalents 3. **SequenceMatcher**: Compute optimal global alignment (LCS-based, O(nΒ²)) 4. **Transfer**: For `equal` blocks β†’ direct timestamp copy. For `replace` β†’ linear interpolation 5. **Gap-fill**: Interpolate from surrounding anchors for missed words ### Onset Detection for Refinement After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using: 1. **Fused ODF**: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels) 2. **Backtrack**: Each onset is snapped to the preceding energy trough (true attack point) 3. **RMS decay**: Word ends are found where energy drops below threshold 4. **Silence gaps**: Inter-word pauses provide definitive boundary anchors ## References - **WhisperX**: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) β€” Forced phoneme alignment - **HTDemucs**: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) β€” Hybrid Transformer source separation - **ALT Benchmark**: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) β€” Demucs+Whisper for lyrics - **Granite Speech**: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) β€” In-Sync timestamp training - **LRCLIB**: [lrclib.net](https://lrclib.net) β€” Community synced lyrics database - **AcoustID**: [acoustid.org](https://acoustid.org) β€” Open audio fingerprint database ## License MIT ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern