--- language: de license: apache-2.0 tags: - whisper - lora - peft - automatic-speech-recognition - singing-voice - lyrics-transcription base_model: openai/whisper-small library_name: peft pipeline_tag: automatic-speech-recognition --- # AUTOLYRICS — Whisper-small + LoRA for Singing Lyrics Transcription LoRA adapter for `openai/whisper-small`, fine-tuned for **singing voice → lyrics** transcription. Built as a 4-day end-to-end ML project; see the full repo at [GitHub](https://github.com/ramduvvuri/autolyrics) and live demo at [HF Space](https://huggingface.co/spaces/Petercoder/autolyrics). ## Why this exists Off-the-shelf ASR fails on singing because of pitch variation, sustained phonemes, rhythm irregularities, and (often) backing music. This adapter recovers a substantial fraction of that loss with ~0.5% extra trainable parameters. ## Results on held-out singing test set | Metric | Whisper-small (baseline) | + LoRA (this adapter) | Δ | |---|---|---|---| | WER | 37.5% | **34.5%** | **-3.0 pts** | | CER | 27.1% | **17.8%** | -9.3 pts | | RTF on T4 | 0.03 | 0.03 | ~same | Test set: 13 clips, song-disjoint from train. ## How to use ```python from peft import PeftModel from transformers import WhisperForConditionalGeneration, WhisperProcessor import torchaudio base = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") model = PeftModel.from_pretrained(base, "Petercoder/autolyrics-whisper-small-lora") proc = WhisperProcessor.from_pretrained("Petercoder/autolyrics-whisper-small-lora") model.generation_config.language = "de" model.generation_config.task = "transcribe" model.generation_config.forced_decoder_ids = None wav, sr = torchaudio.load("song_clip.wav") if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True) if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000) feats = proc(wav.squeeze(0).numpy(), sampling_rate=16000, return_tensors="pt").input_features ids = model.generate(feats, num_beams=5, max_new_tokens=225) print(proc.batch_decode(ids, skip_special_tokens=True)[0]) ``` For best results, isolate vocals first with [Demucs](https://github.com/facebookresearch/demucs) (`htdemucs_ft`), then pass the `vocals.wav` to this model. ## Training details - Base model: `openai/whisper-small` (244M params) - PEFT: LoRA, r=32, alpha=64, dropout=0.05, target=`q_proj,v_proj` - Trainable params: ~1.2M (~0.5% of total) - Optimizer: AdamW, lr=1e-3, linear warmup 50 steps - Batch: 8 × grad_accum 2 = effective 16; fp16 - Epochs: 5 with early stopping (patience=2) on eval WER - Hardware: single NVIDIA T4 (Colab Pro) ## Dataset DSing30 + curated Jamendo Lyrics subset, vocal-isolated via Demucs htdemucs_ft, song-disjoint train/val/test splits. ## Limitations - German only (training data was German). - Heavy distortion / extreme growl vocals are still hard. - Best results require vocal isolation as a preprocessing step. ## Citation ``` @misc{autolyrics2026, author = { ramduvvuri }, title = {AUTOLYRICS: LoRA Fine-tuning of Whisper for Singing Lyrics}, year = {2026}, howpublished = {\url{https://github.com/ramduvvuri/autolyrics}} } ```