Update ML Intern artifact metadata

2baec46 verified 18 days ago

11.2 kB

	---
	tags:
	- ml-intern
	---
	# 🎵 lyric-sync

	Automatic perfect song lyric acquisition and synchronization.

	Produces word-level synchronized lyrics with sub-10ms precision from any audio file.

	## Pipeline Architecture

	```
	┌─────────────────────────────────────────────────────────────────────────┐
	│ lyric-sync Pipeline │
	├─────────────────────────────────────────────────────────────────────────┤
	│ │
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
	│ │ Input │ │ Demucs │ │ WhisperX │ │ Output │ │
	│ │ Audio │───▶│ Vocals │───▶│Transcribe│───▶│ Synced │ │
	│ │ (mix) │ │Separation│ │ + Timing │ │ Lyrics │ │
	│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
	│ │ ▲ ▲ │
	│ │ │ │ │
	│ ▼ │ │ │
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
	│ │AcoustID │───▶│ Fetch │ │Align ASR │ │ Refine │ │
	│ │ Identify │ │Reference │───▶│to Lyrics │───▶│ Onsets/ │ │
	│ │ Song │ │ Lyrics │ │(transfer │ │ Offsets │ │
	│ └──────────┘ └──────────┘ │ timings) │ └──────────┘ │
	│ │ └──────────┘ │
	│ ▼ (fallback) │
	│ ┌──────────┐ │
	│ │Transcript│ │
	│ │ Search │ │
	│ └──────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────────────┘
	```

	## Steps in Detail

	### 1. Song Identification
	- Primary: Audio fingerprinting via Chromaprint/fpcalc → AcoustID lookup → MusicBrainz metadata
	- Fallback: Transcribe vocals → search lyrics databases (LRCLIB, Genius) for matching text

	### 2. Vocal Stem Separation (Demucs)
	- Uses `htdemucs_ft` (best available: ~9.2 dB SDR on MUSDB18-HQ)
	- Produces clean vocal track, dramatically improving downstream ASR accuracy
	- Per [arxiv:2506.15514](https://arxiv.org/abs/2506.15514): Demucs + Whisper achieves ~20% WER on singing

	### 3. Word-Level Transcription (WhisperX)
	- WhisperX (recommended): Whisper large-v2 transcription + wav2vec2 forced phoneme alignment
	- Decoupled approach is robust to timing drift on stretched/sung syllables
	- Alternative backends: Whisper (transformers pipeline), Granite Speech 4.1

	### 4. Reference Lyrics Acquisition
	- LRCLIB (free, no auth): Community-maintained LRC database with synced timestamps
	- syncedlyrics (multi-source): Aggregates Lrclib + NetEase + Musixmatch + Megalobiz
	- Genius (fallback): Plain text lyrics, requires API key

	### 5. Sequence Alignment (ASR → Reference)
	- Maps imperfect ASR output onto correct reference lyrics text
	- Uses `difflib.SequenceMatcher` (LCS-based global alignment)
	- Handles: exact matches (direct transfer), substitutions (interpolation), gaps
	- Optional fuzzy pre-pass for phonetic ASR errors ("gonna" → "going to")
	- Handles repeated sections (chorus/verse) via sectional alignment

	### 6. Timing Refinement (Audio Analysis)
	- Onset detection: Spectral flux + librosa ODF → snap word starts to actual sound onsets
	- Energy envelope: RMS decay → find precise word endings
	- Silence gaps: Detect inter-word pauses → refine boundaries
	- Backtracking: Snaps to the energy trough preceding each onset (true word start)
	- Result: sub-10ms precision (5.8ms frame resolution at 44100Hz, hop=256)

	## Installation

	```bash
	# Core (separation + refinement)
	pip install lyric-sync

	# With WhisperX transcription (recommended)
	pip install lyric-sync[whisperx]

	# With song identification
	pip install lyric-sync[identify]

	# Everything
	pip install lyric-sync[all]

	# System dependency: chromaprint (for AcoustID fingerprinting)
	# Ubuntu/Debian:
	sudo apt-get install chromaprint-tools ffmpeg
	# macOS:
	brew install chromaprint ffmpeg
	```

	## Usage

	### CLI

	```bash
	# Full automatic (identify + fetch lyrics + sync)
	lyric-sync song.mp3 --acoustid-key YOUR_KEY -v

	# With known metadata (faster, skips fingerprinting)
	lyric-sync song.mp3 --artist "Radiohead" --title "Creep" -o synced.lrc

	# JSON output for apps
	lyric-sync song.mp3 --artist "Queen" --title "Bohemian Rhapsody" --format json

	# ASS karaoke subtitles
	lyric-sync song.mp3 --artist "Artist" --title "Song" --format ass -o karaoke.ass

	# CPU-only processing (slower but no GPU needed)
	lyric-sync song.mp3 --device cpu --artist "Artist" --title "Song"
	```

	### Python API

	```python
	from lyric_sync import LyricSyncPipeline

	# Initialize
	pipeline = LyricSyncPipeline(
	acoustid_key="YOUR_ACOUSTID_KEY", # optional
	device="cuda", # or "cpu"
	)

	# Full auto
	result = pipeline.sync("song.mp3")

	# With known metadata
	result = pipeline.sync(
	"song.mp3",
	artist="Radiohead",
	title="Creep",
	)

	# Access results
	print(result.song) # SongIdentification(title=..., artist=...)
	print(result.quality_score) # 0.85 (0-1 quality estimate)

	# Export
	print(result.to_lrc()) # Enhanced LRC with word-level timestamps
	print(result.to_json()) # JSON array of {word, start, end, confidence}
	print(result.to_srt()) # SRT subtitles
	print(result.to_ass()) # ASS karaoke with \k tags
	```

	### Step-by-Step (Advanced)

	```python
	from lyric_sync.separate import VocalSeparator
	from lyric_sync.transcribe import transcribe_vocals
	from lyric_sync.lyrics import fetch_lyrics
	from lyric_sync.align import align_words
	from lyric_sync.refine import refine_timings

	# 1. Separate vocals
	separator = VocalSeparator(device="cuda")
	vocals_16k, sr = separator.extract_vocals("song.mp3", target_sr=16000)
	vocals_full, sr_full = separator.extract_vocals_full_rate("song.mp3")

	# 2. Transcribe
	transcript = transcribe_vocals(vocals_16k, sr=sr, backend="whisperx")

	# 3. Fetch lyrics
	lyrics = fetch_lyrics(artist="Radiohead", title="Creep")

	# 4. Align
	aligned_words, stats = align_words(
	asr_words=transcript.words,
	ref_words=lyrics.words,
	)

	# 5. Refine
	refined_words = refine_timings(vocals_full, sr_full, aligned_words)
	```

	## Output Formats

	\| Format \| Description \| Use Case \|
	\|--------\|-------------\|----------\|
	\| `lrc` (enhanced) \| `[MM:SS.cc] <MM:SS.cc> word ...` \| Music players with word-level sync \|
	\| `lrc_standard` \| `[MM:SS.cc] Line of text` \| Standard music players \|
	\| `json` \| `[{"word": ..., "start": ..., "end": ...}]` \| Apps, programmatic use \|
	\| `srt` \| Standard SRT subtitles \| Video players \|
	\| `ass` \| ASS with `\kf` karaoke tags \| Karaoke / video editing \|

	## Configuration

	### Environment Variables

	\| Variable \| Description \|
	\|----------\|-------------\|
	\| `ACOUSTID_API_KEY` \| AcoustID API key (free, register at acoustid.org) \|
	\| `GENIUS_TOKEN` \| Genius API token (free, for plain lyrics fallback) \|

	### Hardware Requirements

	\| Component \| GPU (CUDA) \| CPU \|
	\|-----------\|-----------\|-----\|
	\| Demucs (htdemucs_ft) \| ~4-6 GB VRAM \| ~8 GB RAM, slower \|
	\| WhisperX (large-v2) \| ~5-6 GB VRAM \| ~8 GB RAM, much slower \|
	\| Total \| ~10-12 GB VRAM \| ~16 GB RAM \|
	\| Processing time (4min song) \| ~30-60s \| ~5-10 min \|

	### Transcription Backends

	\| Backend \| Quality (singing) \| Speed \| Dependencies \|
	\|---------\|----------\|-------\|--------------\|
	\| WhisperX ⭐ \| Best (phoneme alignment) \| Fast (batched) \| `whisperx` \|
	\| Whisper (pipeline) \| Good (attention-based) \| Fast \| `transformers` \|
	\| Granite Speech \| Unknown (speech-trained) \| Medium \| `transformers` \|

	## How It Works (Technical)

	### Alignment Algorithm

	The core challenge: ASR makes errors on singing (WER ~15-25%), but we need timestamps
	on the correct lyrics. We solve this with sequence alignment:

	1. Normalize both word sequences (lowercase, strip punctuation, expand contractions)
	2. Fuzzy pre-pass: Map phonetically similar ASR words to their reference equivalents
	3. SequenceMatcher: Compute optimal global alignment (LCS-based, O(n²))
	4. Transfer: For `equal` blocks → direct timestamp copy. For `replace` → linear interpolation
	5. Gap-fill: Interpolate from surrounding anchors for missed words

	### Onset Detection for Refinement

	After alignment gives ~20-50ms accuracy, we refine to ~5-10ms using:

	1. Fused ODF: Spectral flux (catches plosives: p/b/t/k) + librosa onset_strength (catches vowels)
	2. Backtrack: Each onset is snapped to the preceding energy trough (true attack point)
	3. RMS decay: Word ends are found where energy drops below threshold
	4. Silence gaps: Inter-word pauses provide definitive boundary anchors

	## References

	- WhisperX: [arxiv:2303.00747](https://arxiv.org/abs/2303.00747) — Forced phoneme alignment
	- HTDemucs: [arxiv:2211.08553](https://arxiv.org/abs/2211.08553) — Hybrid Transformer source separation
	- ALT Benchmark: [arxiv:2506.15514](https://arxiv.org/abs/2506.15514) — Demucs+Whisper for lyrics
	- Granite Speech: [arxiv:2604.22817](https://arxiv.org/abs/2604.22817) — In-Sync timestamp training
	- LRCLIB: [lrclib.net](https://lrclib.net) — Community synced lyrics database
	- AcoustID: [acoustid.org](https://acoustid.org) — Open audio fingerprint database

	## License

	MIT

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern