lyric-sync / README.md
rikhoffbauer2's picture
Update ML Intern artifact metadata
2baec46 verified
---
tags:
- ml-intern
---
# 🎵 lyric-sync
**Automatic perfect song lyric acquisition and synchronization.**
Produces word-level synchronized lyrics with sub-10ms precision from any audio file.
## Pipeline Architecture
```
┌─────────────────────────────────────────────────────────────────────────┐
│ lyric-sync Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Input │ │ Demucs │ │ WhisperX │ │ Output │ │
│ │ Audio │───▶│ Vocals │───▶│Transcribe│───▶│ Synced │ │
│ │ (mix) │ │Separation│ │ + Timing │ │ Lyrics │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ ▲ ▲ │
│ │ │ │ │
│ ▼ │ │ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │AcoustID │───▶│ Fetch │ │Align ASR │ │ Refine │ │
│ │ Identify │ │Reference │───▶│to Lyrics │───▶│ Onsets/ │ │
│ │ Song │ │ Lyrics │ │(transfer │ │ Offsets │ │
│ └──────────┘ └──────────┘ │ timings) │ └──────────┘ │
│ │ └──────────┘ │
│ ▼ (fallback) │
│ ┌──────────┐ │
│ │Transcript│ │
│ │ Search │ │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
## Steps in Detail
### 1. Song Identification
- **Primary**: Audio fingerprinting via Chromaprint/fpcalc → AcoustID lookup → MusicBrainz metadata
- **Fallback**: Transcribe vocals → search lyrics databases (LRCLIB, Genius) for matching text
### 2. Vocal Stem Separation (Demucs)
- Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ)
- Produces clean vocal track, dramatically improving downstream ASR accuracy
- Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing
### 3. Word-Level Transcription (WhisperX)
- **WhisperX** (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment
- Decoupled approach is robust to timing drift on stretched/sung syllables
- Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1
### 4. Reference Lyrics Acquisition
- **LRCLIB** (free, no auth): Community-maintained LRC database with synced timestamps
- **syncedlyrics** (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz
- **Genius** (fallback): Plain text lyrics, requires API key
### 5. Sequence Alignment (ASR → Reference)
- Maps imperfect ASR output onto correct reference lyrics text
- Uses `difflib.SequenceMatcher` (LCS-based global alignment)
- Handles: exact matches (direct transfer), substitutions (interpolation), gaps
- Optional fuzzy pre-pass for phonetic ASR errors ("gonna" → "going to")
- Handles repeated sections (chorus/verse) via sectional alignment
### 6. Timing Refinement (Audio Analysis)
- **Onset detection**: Spectral flux + librosa ODF → snap word starts to actual sound onsets
- **Energy envelope**: RMS decay → find precise word endings
- **Silence gaps**: Detect inter-word pauses → refine boundaries
- **Backtracking**: Snaps to the energy trough preceding each onset (true word start)
- Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256)
## Installation
```bash
# Core (separation + refinement)
pip install lyric-sync
# With WhisperX transcription (recommended)
pip install lyric-sync[whisperx]
# With song identification
pip install lyric-sync[identify]
# Everything
pip install lyric-sync[all]
# System dependency: chromaprint (for AcoustID fingerprinting)
# Ubuntu/Debian:
sudo apt-get install chromaprint-tools ffmpeg
# macOS:
brew install chromaprint ffmpeg
```
## Usage
### CLI
```bash
# Full automatic (identify + fetch lyrics + sync)
lyric-sync song.mp3 --acoustid-key YOUR_KEY -v
# With known metadata (faster, skips fingerprinting)
lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc
# JSON output for apps
lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json
# ASS karaoke subtitles
lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass
# CPU-only processing (slower but no GPU needed)
lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song"
```
### Python API
```python
from lyric_sync import LyricSyncPipeline
# Initialize
pipeline = LyricSyncPipeline(
acoustid_key="YOUR_ACOUSTID_KEY", # optional
device="cuda", # or "cpu"
)
# Full auto
result = pipeline.sync("song.mp3")
# With known metadata
result = pipeline.sync(
"song.mp3",
artist="Radiohead",
title="Creep",
)
# Access results
print(result.song) # SongIdentification(title=..., artist=...)
print(result.quality_score) # 0.85 (0-1 quality estimate)
# Export
print(result.to_lrc()) # Enhanced LRC with word-level timestamps
print(result.to_json()) # JSON array of {word, start, end, confidence}
print(result.to_srt()) # SRT subtitles
print(result.to_ass()) # ASS karaoke with \k tags
```
### Step-by-Step (Advanced)
```python
from lyric_sync.separate import VocalSeparator
from lyric_sync.transcribe import transcribe_vocals
from lyric_sync.lyrics import fetch_lyrics
from lyric_sync.align import align_words
from lyric_sync.refine import refine_timings
# 1. Separate vocals
separator = VocalSeparator(device="cuda")
vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000)
vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3")
# 2. Transcribe
transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx")
# 3. Fetch lyrics
lyrics = fetch_lyrics(artist="Radiohead", title="Creep")
# 4. Align
aligned_words, stats = align_words(
asr_words=transcript.words,
ref_words=lyrics.words,
)
# 5. Refine
refined_words = refine_timings(vocals_full, sr_full, aligned_words)
```
## Output Formats
| Format | Description | Use Case |
|--------|-------------|----------|
| `lrc` (enhanced) | `[MM:SS.cc] <MM:SS.cc> word ...` | Music players with word-level sync |
| `lrc_standard` | `[MM:SS.cc] Line of text` | Standard music players |
| `json` | `[{"word": ..., "start": ..., "end": ...}]` | Apps, programmatic use |
| `srt` | Standard SRT subtitles | Video players |
| `ass` | ASS with `\kf` karaoke tags | Karaoke / video editing |
## Configuration
### Environment Variables
| Variable | Description |
|----------|-------------|
| `ACOUSTID_API_KEY` | AcoustID API key (free, register at acoustid.org) |
| `GENIUS_TOKEN` | Genius API token (free, for plain lyrics fallback) |
### Hardware Requirements
| Component | GPU (CUDA) | CPU |
|-----------|-----------|-----|
| Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower |
| WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower |
| **Total** | **~10-12 GB VRAM** | **~16 GB RAM** |
| Processing time (4min song) | ~30-60s | ~5-10 min |
### Transcription Backends
| Backend | Quality (singing) | Speed | Dependencies |
|---------|----------|-------|--------------|
| **WhisperX** ⭐ | Best (phoneme alignment) | Fast (batched) | `whisperx` |
| Whisper (pipeline) | Good (attention-based) | Fast | `transformers` |
| Granite Speech | Unknown (speech-trained) | Medium | `transformers` |
## How It Works (Technical)
### Alignment Algorithm
The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps
on the *correct* lyrics. We solve this with sequence alignment:
1. **Normalize** both word sequences (lowercase, strip punctuation, expand contractions)
2. **Fuzzy pre-pass**: Map phonetically similar ASR words to their reference equivalents
3. **SequenceMatcher**: Compute optimal global alignment (LCS-based, O(n²))
4. **Transfer**: For `equal` blocks → direct timestamp copy. For `replace` → linear interpolation
5. **Gap-fill**: Interpolate from surrounding anchors for missed words
### Onset Detection for Refinement
After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using:
1. **Fused ODF**: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels)
2. **Backtrack**: Each onset is snapped to the preceding energy trough (true attack point)
3. **RMS decay**: Word ends are found where energy drops below threshold
4. **Silence gaps**: Inter-word pauses provide definitive boundary anchors
## References
- **WhisperX**: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) — Forced phoneme alignment
- **HTDemucs**: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) — Hybrid Transformer source separation
- **ALT Benchmark**: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) — Demucs+Whisper for lyrics
- **Granite Speech**: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) — In-Sync timestamp training
- **LRCLIB**: [lrclib.net](https://lrclib.net) — Community synced lyrics database
- **AcoustID**: [acoustid.org](https://acoustid.org) — Open audio fingerprint database
## License
MIT
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern