rikhoffbauer2 commited on
Commit
f623d99
Β·
verified Β·
1 Parent(s): 05d6e98

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -15
README.md CHANGED
@@ -1,26 +1,246 @@
1
- ---
2
- tags:
3
- - ml-intern
4
- ---
5
 
6
- # rikhoffbauer2/lyric-sync
7
 
8
- <!-- ml-intern-provenance -->
9
- ## Generated by ML Intern
10
 
11
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
12
 
13
- - Try ML Intern: https://smolagents-ml-intern.hf.space
14
- - Source code: https://github.com/huggingface/ml-intern
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Usage
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ```python
19
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- model_id = "rikhoffbauer2/lyric-sync"
22
- tokenizer = AutoTokenizer.from_pretrained(model_id)
23
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
 
 
 
 
 
 
24
  ```
25
 
26
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎡 lyric-sync
 
 
 
2
 
3
+ **Automatic perfect song lyric acquisition and synchronization.**
4
 
5
+ Produces word-level synchronized lyrics with sub-10ms precision from any audio file.
 
6
 
7
+ ## Pipeline Architecture
8
 
9
+ ```
10
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
11
+ β”‚ lyric-sync Pipeline β”‚
12
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
13
+ β”‚ β”‚
14
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
15
+ β”‚ β”‚ Input β”‚ β”‚ Demucs β”‚ β”‚ WhisperX β”‚ β”‚ Output β”‚ β”‚
16
+ β”‚ β”‚ Audio │───▢│ Vocals │───▢│Transcribe│───▢│ Synced β”‚ β”‚
17
+ β”‚ β”‚ (mix) β”‚ β”‚Separationβ”‚ β”‚ + Timing β”‚ β”‚ Lyrics β”‚ β”‚
18
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
19
+ β”‚ β”‚ β–² β–² β”‚
20
+ β”‚ β”‚ β”‚ β”‚ β”‚
21
+ β”‚ β–Ό β”‚ β”‚ β”‚
22
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
23
+ β”‚ β”‚AcoustID │───▢│ Fetch β”‚ β”‚Align ASR β”‚ β”‚ Refine β”‚ β”‚
24
+ β”‚ β”‚ Identify β”‚ β”‚Reference │───▢│to Lyrics │───▢│ Onsets/ β”‚ β”‚
25
+ β”‚ β”‚ Song β”‚ β”‚ Lyrics β”‚ β”‚(transfer β”‚ β”‚ Offsets β”‚ β”‚
26
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ timings) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
27
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
28
+ β”‚ β–Ό (fallback) β”‚
29
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
30
+ β”‚ β”‚Transcriptβ”‚ β”‚
31
+ β”‚ β”‚ Search β”‚ β”‚
32
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
33
+ β”‚ β”‚
34
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
35
+ ```
36
+
37
+ ## Steps in Detail
38
+
39
+ ### 1. Song Identification
40
+ - **Primary**: Audio fingerprinting via Chromaprint/fpcalc β†’ AcoustID lookup β†’ MusicBrainz metadata
41
+ - **Fallback**: Transcribe vocals β†’ search lyrics databases (LRCLIB, Genius) for matching text
42
+
43
+ ### 2. Vocal Stem Separation (Demucs)
44
+ - Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ)
45
+ - Produces clean vocal track, dramatically improving downstream ASR accuracy
46
+ - Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing
47
+
48
+ ### 3. Word-Level Transcription (WhisperX)
49
+ - **WhisperX** (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment
50
+ - Decoupled approach is robust to timing drift on stretched/sung syllables
51
+ - Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1
52
+
53
+ ### 4. Reference Lyrics Acquisition
54
+ - **LRCLIB** (free, no auth): Community-maintained LRC database with synced timestamps
55
+ - **syncedlyrics** (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz
56
+ - **Genius** (fallback): Plain text lyrics, requires API key
57
+
58
+ ### 5. Sequence Alignment (ASR β†’ Reference)
59
+ - Maps imperfect ASR output onto correct reference lyrics text
60
+ - Uses `difflib.SequenceMatcher` (LCS-based global alignment)
61
+ - Handles: exact matches (direct transfer), substitutions (interpolation), gaps
62
+ - Optional fuzzy pre-pass for phonetic ASR errors ("gonna" β†’ "going to")
63
+ - Handles repeated sections (chorus/verse) via sectional alignment
64
+
65
+ ### 6. Timing Refinement (Audio Analysis)
66
+ - **Onset detection**: Spectral flux + librosa ODF β†’ snap word starts to actual sound onsets
67
+ - **Energy envelope**: RMS decay β†’ find precise word endings
68
+ - **Silence gaps**: Detect inter-word pauses β†’ refine boundaries
69
+ - **Backtracking**: Snaps to the energy trough preceding each onset (true word start)
70
+ - Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256)
71
+
72
+ ## Installation
73
+
74
+ ```bash
75
+ # Core (separation + refinement)
76
+ pip install lyric-sync
77
+
78
+ # With WhisperX transcription (recommended)
79
+ pip install lyric-sync[whisperx]
80
+
81
+ # With song identification
82
+ pip install lyric-sync[identify]
83
+
84
+ # Everything
85
+ pip install lyric-sync[all]
86
+
87
+ # System dependency: chromaprint (for AcoustID fingerprinting)
88
+ # Ubuntu/Debian:
89
+ sudo apt-get install chromaprint-tools ffmpeg
90
+ # macOS:
91
+ brew install chromaprint ffmpeg
92
+ ```
93
 
94
  ## Usage
95
 
96
+ ### CLI
97
+
98
+ ```bash
99
+ # Full automatic (identify + fetch lyrics + sync)
100
+ lyric-sync song.mp3 --acoustid-key YOUR_KEY -v
101
+
102
+ # With known metadata (faster, skips fingerprinting)
103
+ lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc
104
+
105
+ # JSON output for apps
106
+ lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json
107
+
108
+ # ASS karaoke subtitles
109
+ lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass
110
+
111
+ # CPU-only processing (slower but no GPU needed)
112
+ lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song"
113
+ ```
114
+
115
+ ### Python API
116
+
117
+ ```python
118
+ from lyric_sync import LyricSyncPipeline
119
+
120
+ # Initialize
121
+ pipeline = LyricSyncPipeline(
122
+ acoustid_key="YOUR_ACOUSTID_KEY", # optional
123
+ device="cuda", # or "cpu"
124
+ )
125
+
126
+ # Full auto
127
+ result = pipeline.sync("song.mp3")
128
+
129
+ # With known metadata
130
+ result = pipeline.sync(
131
+ "song.mp3",
132
+ artist="Radiohead",
133
+ title="Creep",
134
+ )
135
+
136
+ # Access results
137
+ print(result.song) # SongIdentification(title=..., artist=...)
138
+ print(result.quality_score) # 0.85 (0-1 quality estimate)
139
+
140
+ # Export
141
+ print(result.to_lrc()) # Enhanced LRC with word-level timestamps
142
+ print(result.to_json()) # JSON array of {word, start, end, confidence}
143
+ print(result.to_srt()) # SRT subtitles
144
+ print(result.to_ass()) # ASS karaoke with \k tags
145
+ ```
146
+
147
+ ### Step-by-Step (Advanced)
148
+
149
  ```python
150
+ from lyric_sync.separate import VocalSeparator
151
+ from lyric_sync.transcribe import transcribe_vocals
152
+ from lyric_sync.lyrics import fetch_lyrics
153
+ from lyric_sync.align import align_words
154
+ from lyric_sync.refine import refine_timings
155
+
156
+ # 1. Separate vocals
157
+ separator = VocalSeparator(device="cuda")
158
+ vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000)
159
+ vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3")
160
+
161
+ # 2. Transcribe
162
+ transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx")
163
 
164
+ # 3. Fetch lyrics
165
+ lyrics = fetch_lyrics(artist="Radiohead", title="Creep")
166
+
167
+ # 4. Align
168
+ aligned_words, stats = align_words(
169
+ asr_words=transcript.words,
170
+ ref_words=lyrics.words,
171
+ )
172
+
173
+ # 5. Refine
174
+ refined_words = refine_timings(vocals_full, sr_full, aligned_words)
175
  ```
176
 
177
+ ## Output Formats
178
+
179
+ | Format | Description | Use Case |
180
+ |--------|-------------|----------|
181
+ | `lrc` (enhanced) | `[MM:SS.cc] <MM:SS.cc> word ...` | Music players with word-level sync |
182
+ | `lrc_standard` | `[MM:SS.cc] Line of text` | Standard music players |
183
+ | `json` | `[{"word": ..., "start": ..., "end": ...}]` | Apps, programmatic use |
184
+ | `srt` | Standard SRT subtitles | Video players |
185
+ | `ass` | ASS with `\kf` karaoke tags | Karaoke / video editing |
186
+
187
+ ## Configuration
188
+
189
+ ### Environment Variables
190
+
191
+ | Variable | Description |
192
+ |----------|-------------|
193
+ | `ACOUSTID_API_KEY` | AcoustID API key (free, register at acoustid.org) |
194
+ | `GENIUS_TOKEN` | Genius API token (free, for plain lyrics fallback) |
195
+
196
+ ### Hardware Requirements
197
+
198
+ | Component | GPU (CUDA) | CPU |
199
+ |-----------|-----------|-----|
200
+ | Demucs (htdemucs_ft) | ~4-6 GB VRAM | ~8 GB RAM, slower |
201
+ | WhisperX (large-v2) | ~5-6 GB VRAM | ~8 GB RAM, much slower |
202
+ | **Total** | **~10-12 GB VRAM** | **~16 GB RAM** |
203
+ | Processing time (4min song) | ~30-60s | ~5-10 min |
204
+
205
+ ### Transcription Backends
206
+
207
+ | Backend | Quality (singing) | Speed | Dependencies |
208
+ |---------|----------|-------|--------------|
209
+ | **WhisperX** ⭐ | Best (phoneme alignment) | Fast (batched) | `whisperx` |
210
+ | Whisper (pipeline) | Good (attention-based) | Fast | `transformers` |
211
+ | Granite Speech | Unknown (speech-trained) | Medium | `transformers` |
212
+
213
+ ## How It Works (Technical)
214
+
215
+ ### Alignment Algorithm
216
+
217
+ The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps
218
+ on the *correct* lyrics. We solve this with sequence alignment:
219
+
220
+ 1. **Normalize** both word sequences (lowercase, strip punctuation, expand contractions)
221
+ 2. **Fuzzy pre-pass**: Map phonetically similar ASR words to their reference equivalents
222
+ 3. **SequenceMatcher**: Compute optimal global alignment (LCS-based, O(nΒ²))
223
+ 4. **Transfer**: For `equal` blocks β†’ direct timestamp copy. For `replace` β†’ linear interpolation
224
+ 5. **Gap-fill**: Interpolate from surrounding anchors for missed words
225
+
226
+ ### Onset Detection for Refinement
227
+
228
+ After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using:
229
+
230
+ 1. **Fused ODF**: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels)
231
+ 2. **Backtrack**: Each onset is snapped to the preceding energy trough (true attack point)
232
+ 3. **RMS decay**: Word ends are found where energy drops below threshold
233
+ 4. **Silence gaps**: Inter-word pauses provide definitive boundary anchors
234
+
235
+ ## References
236
+
237
+ - **WhisperX**: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) β€” Forced phoneme alignment
238
+ - **HTDemucs**: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) β€” Hybrid Transformer source separation
239
+ - **ALT Benchmark**: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) β€” Demucs+Whisper for lyrics
240
+ - **Granite Speech**: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) β€” In-Sync timestamp training
241
+ - **LRCLIB**: [lrclib.net](https://lrclib.net) β€” Community synced lyrics database
242
+ - **AcoustID**: [acoustid.org](https://acoustid.org) β€” Open audio fingerprint database
243
+
244
+ ## License
245
+
246
+ MIT