Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,26 +1,246 @@
|
|
| 1 |
-
-
|
| 2 |
-
tags:
|
| 3 |
-
- ml-intern
|
| 4 |
-
---
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
-
## Generated by ML Intern
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## Usage
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
```python
|
| 19 |
-
from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π΅ lyric-sync
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
**Automatic perfect song lyric acquisition and synchronization.**
|
| 4 |
|
| 5 |
+
Produces word-level synchronized lyrics with sub-10ms precision from any audio file.
|
|
|
|
| 6 |
|
| 7 |
+
## Pipeline Architecture
|
| 8 |
|
| 9 |
+
```
|
| 10 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 11 |
+
β lyric-sync Pipeline β
|
| 12 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 13 |
+
β β
|
| 14 |
+
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
|
| 15 |
+
β β Input β β Demucs β β WhisperX β β Output β β
|
| 16 |
+
β β Audio βββββΆβ Vocals βββββΆβTranscribeβββββΆβ Synced β β
|
| 17 |
+
β β (mix) β βSeparationβ β + Timing β β Lyrics β β
|
| 18 |
+
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
|
| 19 |
+
β β β² β² β
|
| 20 |
+
β β β β β
|
| 21 |
+
β βΌ β β β
|
| 22 |
+
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
|
| 23 |
+
β βAcoustID βββββΆβ Fetch β βAlign ASR β β Refine β β
|
| 24 |
+
β β Identify β βReference βββββΆβto Lyrics βββββΆβ Onsets/ β β
|
| 25 |
+
β β Song β β Lyrics β β(transfer β β Offsets β β
|
| 26 |
+
β ββββββββββββ ββββββββββββ β timings) β ββββββββββββ β
|
| 27 |
+
β β ββββββββββββ β
|
| 28 |
+
β βΌ (fallback) β
|
| 29 |
+
β ββββββββββββ β
|
| 30 |
+
β βTranscriptβ β
|
| 31 |
+
β β Search β β
|
| 32 |
+
β ββββββββββββ β
|
| 33 |
+
β β
|
| 34 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Steps in Detail
|
| 38 |
+
|
| 39 |
+
### 1. Song Identification
|
| 40 |
+
- **Primary**: Audio fingerprinting via Chromaprint/fpcalc β AcoustID lookup β MusicBrainz metadata
|
| 41 |
+
- **Fallback**: Transcribe vocals β search lyrics databases (LRCLIB, Genius) for matching text
|
| 42 |
+
|
| 43 |
+
### 2. Vocal Stem Separation (Demucs)
|
| 44 |
+
- Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ)
|
| 45 |
+
- Produces clean vocal track, dramatically improving downstream ASR accuracy
|
| 46 |
+
- Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing
|
| 47 |
+
|
| 48 |
+
### 3. Word-Level Transcription (WhisperX)
|
| 49 |
+
- **WhisperX** (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment
|
| 50 |
+
- Decoupled approach is robust to timing drift on stretched/sung syllables
|
| 51 |
+
- Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1
|
| 52 |
+
|
| 53 |
+
### 4. Reference Lyrics Acquisition
|
| 54 |
+
- **LRCLIB** (free, no auth): Community-maintained LRC database with synced timestamps
|
| 55 |
+
- **syncedlyrics** (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz
|
| 56 |
+
- **Genius** (fallback): Plain text lyrics, requires API key
|
| 57 |
+
|
| 58 |
+
### 5. Sequence Alignment (ASR β Reference)
|
| 59 |
+
- Maps imperfect ASR output onto correct reference lyrics text
|
| 60 |
+
- Uses `difflib.SequenceMatcher` (LCS-based global alignment)
|
| 61 |
+
- Handles: exact matches (direct transfer), substitutions (interpolation), gaps
|
| 62 |
+
- Optional fuzzy pre-pass for phonetic ASR errors ("gonna" β "going to")
|
| 63 |
+
- Handles repeated sections (chorus/verse) via sectional alignment
|
| 64 |
+
|
| 65 |
+
### 6. Timing Refinement (Audio Analysis)
|
| 66 |
+
- **Onset detection**: Spectral flux + librosa ODF β snap word starts to actual sound onsets
|
| 67 |
+
- **Energy envelope**: RMS decay β find precise word endings
|
| 68 |
+
- **Silence gaps**: Detect inter-word pauses β refine boundaries
|
| 69 |
+
- **Backtracking**: Snaps to the energy trough preceding each onset (true word start)
|
| 70 |
+
- Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256)
|
| 71 |
+
|
| 72 |
+
## Installation
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
# Core (separation + refinement)
|
| 76 |
+
pip install lyric-sync
|
| 77 |
+
|
| 78 |
+
# With WhisperX transcription (recommended)
|
| 79 |
+
pip install lyric-sync[whisperx]
|
| 80 |
+
|
| 81 |
+
# With song identification
|
| 82 |
+
pip install lyric-sync[identify]
|
| 83 |
+
|
| 84 |
+
# Everything
|
| 85 |
+
pip install lyric-sync[all]
|
| 86 |
+
|
| 87 |
+
# System dependency: chromaprint (for AcoustID fingerprinting)
|
| 88 |
+
# Ubuntu/Debian:
|
| 89 |
+
sudo apt-get install chromaprint-tools ffmpeg
|
| 90 |
+
# macOS:
|
| 91 |
+
brew install chromaprint ffmpeg
|
| 92 |
+
```
|
| 93 |
|
| 94 |
## Usage
|
| 95 |
|
| 96 |
+
### CLI
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
# Full automatic (identify + fetch lyrics + sync)
|
| 100 |
+
lyric-sync song.mp3 --acoustid-key YOUR_KEY -v
|
| 101 |
+
|
| 102 |
+
# With known metadata (faster, skips fingerprinting)
|
| 103 |
+
lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc
|
| 104 |
+
|
| 105 |
+
# JSON output for apps
|
| 106 |
+
lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json
|
| 107 |
+
|
| 108 |
+
# ASS karaoke subtitles
|
| 109 |
+
lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass
|
| 110 |
+
|
| 111 |
+
# CPU-only processing (slower but no GPU needed)
|
| 112 |
+
lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song"
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### Python API
|
| 116 |
+
|
| 117 |
+
```python
|
| 118 |
+
from lyric_sync import LyricSyncPipeline
|
| 119 |
+
|
| 120 |
+
# Initialize
|
| 121 |
+
pipeline = LyricSyncPipeline(
|
| 122 |
+
acoustid_key="YOUR_ACOUSTID_KEY", # optional
|
| 123 |
+
device="cuda", # or "cpu"
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
# Full auto
|
| 127 |
+
result = pipeline.sync("song.mp3")
|
| 128 |
+
|
| 129 |
+
# With known metadata
|
| 130 |
+
result = pipeline.sync(
|
| 131 |
+
"song.mp3",
|
| 132 |
+
artist="Radiohead",
|
| 133 |
+
title="Creep",
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
# Access results
|
| 137 |
+
print(result.song) # SongIdentification(title=..., artist=...)
|
| 138 |
+
print(result.quality_score) # 0.85 (0-1 quality estimate)
|
| 139 |
+
|
| 140 |
+
# Export
|
| 141 |
+
print(result.to_lrc()) # Enhanced LRC with word-level timestamps
|
| 142 |
+
print(result.to_json()) # JSON array of {word, start, end, confidence}
|
| 143 |
+
print(result.to_srt()) # SRT subtitles
|
| 144 |
+
print(result.to_ass()) # ASS karaoke with \k tags
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
### Step-by-Step (Advanced)
|
| 148 |
+
|
| 149 |
```python
|
| 150 |
+
from lyric_sync.separate import VocalSeparator
|
| 151 |
+
from lyric_sync.transcribe import transcribe_vocals
|
| 152 |
+
from lyric_sync.lyrics import fetch_lyrics
|
| 153 |
+
from lyric_sync.align import align_words
|
| 154 |
+
from lyric_sync.refine import refine_timings
|
| 155 |
+
|
| 156 |
+
# 1. Separate vocals
|
| 157 |
+
separator = VocalSeparator(device="cuda")
|
| 158 |
+
vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000)
|
| 159 |
+
vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3")
|
| 160 |
+
|
| 161 |
+
# 2. Transcribe
|
| 162 |
+
transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx")
|
| 163 |
|
| 164 |
+
# 3. Fetch lyrics
|
| 165 |
+
lyrics = fetch_lyrics(artist="Radiohead", title="Creep")
|
| 166 |
+
|
| 167 |
+
# 4. Align
|
| 168 |
+
aligned_words, stats = align_words(
|
| 169 |
+
asr_words=transcript.words,
|
| 170 |
+
ref_words=lyrics.words,
|
| 171 |
+
)
|
| 172 |
+
|
| 173 |
+
# 5. Refine
|
| 174 |
+
refined_words = refine_timings(vocals_full, sr_full, aligned_words)
|
| 175 |
```
|
| 176 |
|
| 177 |
+
## Output Formats
|
| 178 |
+
|
| 179 |
+
| Format | Description | Use Case |
|
| 180 |
+
|--------|-------------|----------|
|
| 181 |
+
| `lrc` (enhanced) | `[MM:SS.cc] <MM:SS.cc> word ...` | Music players with word-level sync |
|
| 182 |
+
| `lrc_standard` | `[MM:SS.cc] Line of text` | Standard music players |
|
| 183 |
+
| `json` | `[{"word": ..., "start": ..., "end": ...}]` | Apps, programmatic use |
|
| 184 |
+
| `srt` | Standard SRT subtitles | Video players |
|
| 185 |
+
| `ass` | ASS with `\kf` karaoke tags | Karaoke / video editing |
|
| 186 |
+
|
| 187 |
+
## Configuration
|
| 188 |
+
|
| 189 |
+
### Environment Variables
|
| 190 |
+
|
| 191 |
+
| Variable | Description |
|
| 192 |
+
|----------|-------------|
|
| 193 |
+
| `ACOUSTID_API_KEY` | AcoustID API key (free, register at acoustid.org) |
|
| 194 |
+
| `GENIUS_TOKEN` | Genius API token (free, for plain lyrics fallback) |
|
| 195 |
+
|
| 196 |
+
### Hardware Requirements
|
| 197 |
+
|
| 198 |
+
| Component | GPU (CUDA) | CPU |
|
| 199 |
+
|-----------|-----------|-----|
|
| 200 |
+
| Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower |
|
| 201 |
+
| WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower |
|
| 202 |
+
| **Total** | **~10-12 GB VRAM** | **~16 GB RAM** |
|
| 203 |
+
| Processing time (4min song) | ~30-60s | ~5-10 min |
|
| 204 |
+
|
| 205 |
+
### Transcription Backends
|
| 206 |
+
|
| 207 |
+
| Backend | Quality (singing) | Speed | Dependencies |
|
| 208 |
+
|---------|----------|-------|--------------|
|
| 209 |
+
| **WhisperX** β | Best (phoneme alignment) | Fast (batched) | `whisperx` |
|
| 210 |
+
| Whisper (pipeline) | Good (attention-based) | Fast | `transformers` |
|
| 211 |
+
| Granite Speech | Unknown (speech-trained) | Medium | `transformers` |
|
| 212 |
+
|
| 213 |
+
## How It Works (Technical)
|
| 214 |
+
|
| 215 |
+
### Alignment Algorithm
|
| 216 |
+
|
| 217 |
+
The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps
|
| 218 |
+
on the *correct* lyrics. We solve this with sequence alignment:
|
| 219 |
+
|
| 220 |
+
1. **Normalize** both word sequences (lowercase, strip punctuation, expand contractions)
|
| 221 |
+
2. **Fuzzy pre-pass**: Map phonetically similar ASR words to their reference equivalents
|
| 222 |
+
3. **SequenceMatcher**: Compute optimal global alignment (LCS-based, O(nΒ²))
|
| 223 |
+
4. **Transfer**: For `equal` blocks β direct timestamp copy. For `replace` β linear interpolation
|
| 224 |
+
5. **Gap-fill**: Interpolate from surrounding anchors for missed words
|
| 225 |
+
|
| 226 |
+
### Onset Detection for Refinement
|
| 227 |
+
|
| 228 |
+
After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using:
|
| 229 |
+
|
| 230 |
+
1. **Fused ODF**: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels)
|
| 231 |
+
2. **Backtrack**: Each onset is snapped to the preceding energy trough (true attack point)
|
| 232 |
+
3. **RMS decay**: Word ends are found where energy drops below threshold
|
| 233 |
+
4. **Silence gaps**: Inter-word pauses provide definitive boundary anchors
|
| 234 |
+
|
| 235 |
+
## References
|
| 236 |
+
|
| 237 |
+
- **WhisperX**: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) β Forced phoneme alignment
|
| 238 |
+
- **HTDemucs**: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) β Hybrid Transformer source separation
|
| 239 |
+
- **ALT Benchmark**: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) β Demucs+Whisper for lyrics
|
| 240 |
+
- **Granite Speech**: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) β In-Sync timestamp training
|
| 241 |
+
- **LRCLIB**: [lrclib.net](https://lrclib.net) β Community synced lyrics database
|
| 242 |
+
- **AcoustID**: [acoustid.org](https://acoustid.org) β Open audio fingerprint database
|
| 243 |
+
|
| 244 |
+
## License
|
| 245 |
+
|
| 246 |
+
MIT
|