File size: 11,227 Bytes

---
tags:
- ml-intern
---
# 🎵 lyric-sync

**Automatic perfect song lyric acquisition and synchronization.**

Produces word-level synchronized lyrics with sub-10ms precision from any audio file.

## Pipeline Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                        lyric-sync Pipeline                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐         │
│  │  Input   │    │  Demucs  │    │ WhisperX │    │  Output  │         │
│  │  Audio   │───▶│  Vocals  │───▶│Transcribe│───▶│  Synced  │         │
│  │  (mix)   │    │Separation│    │ + Timing │    │  Lyrics  │         │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘         │
│       │                                ▲               ▲               │
│       │                                │               │               │
│       ▼                                │               │               │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐         │
│  │AcoustID  │───▶│  Fetch   │    │Align ASR │    │  Refine  │         │
│  │ Identify │    │Reference │───▶│to Lyrics │───▶│ Onsets/  │         │
│  │  Song    │    │  Lyrics  │    │(transfer │    │ Offsets  │         │
│  └──────────┘    └──────────┘    │ timings) │    └──────────┘         │
│       │                          └──────────┘                          │
│       ▼ (fallback)                                                     │
│  ┌──────────┐                                                          │
│  │Transcript│                                                          │
│  │ Search   │                                                          │
│  └──────────┘                                                          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

## Steps in Detail

### 1. Song Identification
- **Primary**: Audio fingerprinting via Chromaprint/fpcalc → AcoustID lookup → MusicBrainz metadata
- **Fallback**: Transcribe vocals → search lyrics databases (LRCLIB, Genius) for matching text

### 2. Vocal Stem Separation (Demucs)
- Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ)
- Produces clean vocal track, dramatically improving downstream ASR accuracy
- Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing

### 3. Word-Level Transcription (WhisperX)
- **WhisperX** (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment
- Decoupled approach is robust to timing drift on stretched/sung syllables
- Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1

### 4. Reference Lyrics Acquisition
- **LRCLIB** (free, no auth): Community-maintained LRC database with synced timestamps
- **syncedlyrics** (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz
- **Genius** (fallback): Plain text lyrics, requires API key

### 5. Sequence Alignment (ASR → Reference)
- Maps imperfect ASR output onto correct reference lyrics text
- Uses `difflib.SequenceMatcher` (LCS-based global alignment)
- Handles: exact matches (direct transfer), substitutions (interpolation), gaps
- Optional fuzzy pre-pass for phonetic ASR errors ("gonna" → "going to")
- Handles repeated sections (chorus/verse) via sectional alignment

### 6. Timing Refinement (Audio Analysis)
- **Onset detection**: Spectral flux + librosa ODF → snap word starts to actual sound onsets
- **Energy envelope**: RMS decay → find precise word endings
- **Silence gaps**: Detect inter-word pauses → refine boundaries
- **Backtracking**: Snaps to the energy trough preceding each onset (true word start)
- Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256)

## Installation

```bash
# Core (separation + refinement)
pip install lyric-sync

# With WhisperX transcription (recommended)
pip install lyric-sync[whisperx]

# With song identification
pip install lyric-sync[identify]

# Everything
pip install lyric-sync[all]

# System dependency: chromaprint (for AcoustID fingerprinting)
# Ubuntu/Debian:
sudo apt-get install chromaprint-tools ffmpeg
# macOS:
brew install chromaprint ffmpeg
```

## Usage

### CLI

```bash
# Full automatic (identify + fetch lyrics + sync)
lyric-sync song.mp3 --acoustid-key YOUR_KEY -v

# With known metadata (faster, skips fingerprinting)
lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc

# JSON output for apps
lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json

# ASS karaoke subtitles
lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass

# CPU-only processing (slower but no GPU needed)
lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song"
```

### Python API

```python
from lyric_sync import LyricSyncPipeline

# Initialize
pipeline = LyricSyncPipeline(
    acoustid_key="YOUR_ACOUSTID_KEY",  # optional
    device="cuda",                      # or "cpu"
)

# Full auto
result = pipeline.sync("song.mp3")

# With known metadata
result = pipeline.sync(
    "song.mp3",
    artist="Radiohead",
    title="Creep",
)

# Access results
print(result.song)           # SongIdentification(title=..., artist=...)
print(result.quality_score)  # 0.85 (0-1 quality estimate)

# Export
print(result.to_lrc())      # Enhanced LRC with word-level timestamps
print(result.to_json())     # JSON array of {word, start, end, confidence}
print(result.to_srt())      # SRT subtitles
print(result.to_ass())      # ASS karaoke with \k tags
```

### Step-by-Step (Advanced)

```python
from lyric_sync.separate import VocalSeparator
from lyric_sync.transcribe import transcribe_vocals
from lyric_sync.lyrics import fetch_lyrics
from lyric_sync.align import align_words
from lyric_sync.refine import refine_timings

# 1. Separate vocals
separator = VocalSeparator(device="cuda")
vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000)
vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3")

# 2. Transcribe
transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx")

# 3. Fetch lyrics
lyrics = fetch_lyrics(artist="Radiohead", title="Creep")

# 4. Align
aligned_words, stats = align_words(
    asr_words=transcript.words,
    ref_words=lyrics.words,
)

# 5. Refine
refined_words = refine_timings(vocals_full, sr_full, aligned_words)
```

## Output Formats

| Format | Description | Use Case |
|--------|-------------|----------|
| `lrc` (enhanced) | `[MM:SS.cc] <MM:SS.cc> word ...` | Music players with word-level sync |
| `lrc_standard` | `[MM:SS.cc] Line of text` | Standard music players |
| `json` | `[{"word": ..., "start": ..., "end": ...}]` | Apps, programmatic use |
| `srt` | Standard SRT subtitles | Video players |
| `ass` | ASS with `\kf` karaoke tags | Karaoke / video editing |

## Configuration

### Environment Variables

| Variable | Description |
|----------|-------------|
| `ACOUSTID_API_KEY` | AcoustID API key (free, register at acoustid.org) |
| `GENIUS_TOKEN` | Genius API token (free, for plain lyrics fallback) |

### Hardware Requirements

| Component | GPU (CUDA) | CPU |
|-----------|-----------|-----|
| Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower |
| WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower |
| **Total** | **~10-12 GB VRAM** | **~16 GB RAM** |
| Processing time (4min song) | ~30-60s | ~5-10 min |

### Transcription Backends

| Backend | Quality (singing) | Speed | Dependencies |
|---------|----------|-------|--------------|
| **WhisperX** ⭐ | Best (phoneme alignment) | Fast (batched) | `whisperx` |
| Whisper (pipeline) | Good (attention-based) | Fast | `transformers` |
| Granite Speech | Unknown (speech-trained) | Medium | `transformers` |

## How It Works (Technical)

### Alignment Algorithm

The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps
on the *correct* lyrics. We solve this with sequence alignment:

1. **Normalize** both word sequences (lowercase, strip punctuation, expand contractions)
2. **Fuzzy pre-pass**: Map phonetically similar ASR words to their reference equivalents
3. **SequenceMatcher**: Compute optimal global alignment (LCS-based, O(n²))
4. **Transfer**: For `equal` blocks → direct timestamp copy. For `replace` → linear interpolation
5. **Gap-fill**: Interpolate from surrounding anchors for missed words

### Onset Detection for Refinement

After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using:

1. **Fused ODF**: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels)
2. **Backtrack**: Each onset is snapped to the preceding energy trough (true attack point)
3. **RMS decay**: Word ends are found where energy drops below threshold
4. **Silence gaps**: Inter-word pauses provide definitive boundary anchors

## References

- **WhisperX**: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) — Forced phoneme alignment
- **HTDemucs**: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) — Hybrid Transformer source separation
- **ALT Benchmark**: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) — Demucs+Whisper for lyrics
- **Granite Speech**: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) — In-Sync timestamp training
- **LRCLIB**: [lrclib.net](https://lrclib.net) — Community synced lyrics database
- **AcoustID**: [acoustid.org](https://acoustid.org) — Open audio fingerprint database

## License

MIT

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern