| --- |
| tags: |
| - ml-intern |
| --- |
| # 🎵 lyric-sync |
|
|
| **Automatic perfect song lyric acquisition and synchronization.** |
|
|
| Produces word-level synchronized lyrics with sub-10ms precision from any audio file. |
|
|
| ## Pipeline Architecture |
|
|
| ``` |
| ┌─────────────────────────────────────────────────────────────────────────┐ |
| │ lyric-sync Pipeline │ |
| ├─────────────────────────────────────────────────────────────────────────┤ |
| │ │ |
| │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ |
| │ │ Input │ │ Demucs │ │ WhisperX │ │ Output │ │ |
| │ │ Audio │───▶│ Vocals │───▶│Transcribe│───▶│ Synced │ │ |
| │ │ (mix) │ │Separation│ │ + Timing │ │ Lyrics │ │ |
| │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ |
| │ │ ▲ ▲ │ |
| │ │ │ │ │ |
| │ ▼ │ │ │ |
| │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ |
| │ │AcoustID │───▶│ Fetch │ │Align ASR │ │ Refine │ │ |
| │ │ Identify │ │Reference │───▶│to Lyrics │───▶│ Onsets/ │ │ |
| │ │ Song │ │ Lyrics │ │(transfer │ │ Offsets │ │ |
| │ └──────────┘ └──────────┘ │ timings) │ └──────────┘ │ |
| │ │ └──────────┘ │ |
| │ ▼ (fallback) │ |
| │ ┌──────────┐ │ |
| │ │Transcript│ │ |
| │ │ Search │ │ |
| │ └──────────┘ │ |
| │ │ |
| └─────────────────────────────────────────────────────────────────────────┘ |
| ``` |
|
|
| ## Steps in Detail |
|
|
| ### 1. Song Identification |
| - **Primary**: Audio fingerprinting via Chromaprint/fpcalc → AcoustID lookup → MusicBrainz metadata |
| - **Fallback**: Transcribe vocals → search lyrics databases (LRCLIB, Genius) for matching text |
|
|
| ### 2. Vocal Stem Separation (Demucs) |
| - Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ) |
| - Produces clean vocal track, dramatically improving downstream ASR accuracy |
| - Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing |
|
|
| ### 3. Word-Level Transcription (WhisperX) |
| - **WhisperX** (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment |
| - Decoupled approach is robust to timing drift on stretched/sung syllables |
| - Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1 |
|
|
| ### 4. Reference Lyrics Acquisition |
| - **LRCLIB** (free, no auth): Community-maintained LRC database with synced timestamps |
| - **syncedlyrics** (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz |
| - **Genius** (fallback): Plain text lyrics, requires API key |
|
|
| ### 5. Sequence Alignment (ASR → Reference) |
| - Maps imperfect ASR output onto correct reference lyrics text |
| - Uses `difflib.SequenceMatcher` (LCS-based global alignment) |
| - Handles: exact matches (direct transfer), substitutions (interpolation), gaps |
| - Optional fuzzy pre-pass for phonetic ASR errors ("gonna" → "going to") |
| - Handles repeated sections (chorus/verse) via sectional alignment |
|
|
| ### 6. Timing Refinement (Audio Analysis) |
| - **Onset detection**: Spectral flux + librosa ODF → snap word starts to actual sound onsets |
| - **Energy envelope**: RMS decay → find precise word endings |
| - **Silence gaps**: Detect inter-word pauses → refine boundaries |
| - **Backtracking**: Snaps to the energy trough preceding each onset (true word start) |
| - Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256) |
|
|
| ## Installation |
|
|
| ```bash |
| # Core (separation + refinement) |
| pip install lyric-sync |
| |
| # With WhisperX transcription (recommended) |
| pip install lyric-sync[whisperx] |
| |
| # With song identification |
| pip install lyric-sync[identify] |
| |
| # Everything |
| pip install lyric-sync[all] |
| |
| # System dependency: chromaprint (for AcoustID fingerprinting) |
| # Ubuntu/Debian: |
| sudo apt-get install chromaprint-tools ffmpeg |
| # macOS: |
| brew install chromaprint ffmpeg |
| ``` |
|
|
| ## Usage |
|
|
| ### CLI |
|
|
| ```bash |
| # Full automatic (identify + fetch lyrics + sync) |
| lyric-sync song.mp3 --acoustid-key YOUR_KEY -v |
| |
| # With known metadata (faster, skips fingerprinting) |
| lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc |
| |
| # JSON output for apps |
| lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json |
| |
| # ASS karaoke subtitles |
| lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass |
| |
| # CPU-only processing (slower but no GPU needed) |
| lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song" |
| ``` |
|
|
| ### Python API |
|
|
| ```python |
| from lyric_sync import LyricSyncPipeline |
| |
| # Initialize |
| pipeline = LyricSyncPipeline( |
| acoustid_key="YOUR_ACOUSTID_KEY", # optional |
| device="cuda", # or "cpu" |
| ) |
| |
| # Full auto |
| result = pipeline.sync("song.mp3") |
| |
| # With known metadata |
| result = pipeline.sync( |
| "song.mp3", |
| artist="Radiohead", |
| title="Creep", |
| ) |
| |
| # Access results |
| print(result.song) # SongIdentification(title=..., artist=...) |
| print(result.quality_score) # 0.85 (0-1 quality estimate) |
| |
| # Export |
| print(result.to_lrc()) # Enhanced LRC with word-level timestamps |
| print(result.to_json()) # JSON array of {word, start, end, confidence} |
| print(result.to_srt()) # SRT subtitles |
| print(result.to_ass()) # ASS karaoke with \k tags |
| ``` |
|
|
| ### Step-by-Step (Advanced) |
|
|
| ```python |
| from lyric_sync.separate import VocalSeparator |
| from lyric_sync.transcribe import transcribe_vocals |
| from lyric_sync.lyrics import fetch_lyrics |
| from lyric_sync.align import align_words |
| from lyric_sync.refine import refine_timings |
| |
| # 1. Separate vocals |
| separator = VocalSeparator(device="cuda") |
| vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000) |
| vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3") |
| |
| # 2. Transcribe |
| transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx") |
| |
| # 3. Fetch lyrics |
| lyrics = fetch_lyrics(artist="Radiohead", title="Creep") |
| |
| # 4. Align |
| aligned_words, stats = align_words( |
| asr_words=transcript.words, |
| ref_words=lyrics.words, |
| ) |
| |
| # 5. Refine |
| refined_words = refine_timings(vocals_full, sr_full, aligned_words) |
| ``` |
|
|
| ## Output Formats |
|
|
| | Format | Description | Use Case | |
| |--------|-------------|----------| |
| | `lrc` (enhanced) | `[MM:SS.cc] <MM:SS.cc> word ...` | Music players with word-level sync | |
| | `lrc_standard` | `[MM:SS.cc] Line of text` | Standard music players | |
| | `json` | `[{"word": ..., "start": ..., "end": ...}]` | Apps, programmatic use | |
| | `srt` | Standard SRT subtitles | Video players | |
| | `ass` | ASS with `\kf` karaoke tags | Karaoke / video editing | |
|
|
| ## Configuration |
|
|
| ### Environment Variables |
|
|
| | Variable | Description | |
| |----------|-------------| |
| | `ACOUSTID_API_KEY` | AcoustID API key (free, register at acoustid.org) | |
| | `GENIUS_TOKEN` | Genius API token (free, for plain lyrics fallback) | |
|
|
| ### Hardware Requirements |
|
|
| | Component | GPU (CUDA) | CPU | |
| |-----------|-----------|-----| |
| | Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower | |
| | WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower | |
| | **Total** | **~10-12 GB VRAM** | **~16 GB RAM** | |
| | Processing time (4min song) | ~30-60s | ~5-10 min | |
| |
| ### Transcription Backends |
| |
| | Backend | Quality (singing) | Speed | Dependencies | |
| |---------|----------|-------|--------------| |
| | **WhisperX** ⭐ | Best (phoneme alignment) | Fast (batched) | `whisperx` | |
| | Whisper (pipeline) | Good (attention-based) | Fast | `transformers` | |
| | Granite Speech | Unknown (speech-trained) | Medium | `transformers` | |
| |
| ## How It Works (Technical) |
| |
| ### Alignment Algorithm |
| |
| The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps |
| on the *correct* lyrics. We solve this with sequence alignment: |
| |
| 1. **Normalize** both word sequences (lowercase, strip punctuation, expand contractions) |
| 2. **Fuzzy pre-pass**: Map phonetically similar ASR words to their reference equivalents |
| 3. **SequenceMatcher**: Compute optimal global alignment (LCS-based, O(n²)) |
| 4. **Transfer**: For `equal` blocks → direct timestamp copy. For `replace` → linear interpolation |
| 5. **Gap-fill**: Interpolate from surrounding anchors for missed words |
| |
| ### Onset Detection for Refinement |
| |
| After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using: |
| |
| 1. **Fused ODF**: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels) |
| 2. **Backtrack**: Each onset is snapped to the preceding energy trough (true attack point) |
| 3. **RMS decay**: Word ends are found where energy drops below threshold |
| 4. **Silence gaps**: Inter-word pauses provide definitive boundary anchors |
|
|
| ## References |
|
|
| - **WhisperX**: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) — Forced phoneme alignment |
| - **HTDemucs**: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) — Hybrid Transformer source separation |
| - **ALT Benchmark**: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) — Demucs+Whisper for lyrics |
| - **Granite Speech**: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) — In-Sync timestamp training |
| - **LRCLIB**: [lrclib.net](https://lrclib.net) — Community synced lyrics database |
| - **AcoustID**: [acoustid.org](https://acoustid.org) — Open audio fingerprint database |
|
|
| ## License |
|
|
| MIT |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|