---
language: de
license: apache-2.0
tags:
  - whisper
  - lora
  - peft
  - automatic-speech-recognition
  - singing-voice
  - lyrics-transcription
base_model: openai/whisper-small
library_name: peft
pipeline_tag: automatic-speech-recognition
---

# AUTOLYRICS — Whisper-small + LoRA for Singing Lyrics Transcription

LoRA adapter for `openai/whisper-small`, fine-tuned for **singing voice → lyrics**
transcription. Built as a 4-day end-to-end ML project; see the full repo at
[GitHub](https://github.com/ramduvvuri/autolyrics) and live demo at
[HF Space](https://huggingface.co/spaces/Petercoder/autolyrics).

## Why this exists

Off-the-shelf ASR fails on singing because of pitch variation, sustained
phonemes, rhythm irregularities, and (often) backing music. This adapter
recovers a substantial fraction of that loss with ~0.5% extra trainable
parameters.

## Results on held-out singing test set

| Metric | Whisper-small (baseline) | + LoRA (this adapter) | Δ |
|---|---|---|---|
| WER  | 37.5% | **34.5%** | **-3.0 pts** |
| CER  | 27.1% | **17.8%** | -9.3 pts |
| RTF on T4 | 0.03 | 0.03 | ~same |

Test set: 13 clips, song-disjoint from train.

## How to use

```python
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

base = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base, "Petercoder/autolyrics-whisper-small-lora")
proc  = WhisperProcessor.from_pretrained("Petercoder/autolyrics-whisper-small-lora")
model.generation_config.language = "de"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None

wav, sr = torchaudio.load("song_clip.wav")
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
if sr != 16000: wav = torchaudio.functional.resample(wav, sr, 16000)

feats = proc(wav.squeeze(0).numpy(), sampling_rate=16000,
             return_tensors="pt").input_features
ids = model.generate(feats, num_beams=5, max_new_tokens=225)
print(proc.batch_decode(ids, skip_special_tokens=True)[0])
```

For best results, isolate vocals first with [Demucs](https://github.com/facebookresearch/demucs)
(`htdemucs_ft`), then pass the `vocals.wav` to this model.

## Training details

- Base model: `openai/whisper-small` (244M params)
- PEFT: LoRA, r=32, alpha=64, dropout=0.05, target=`q_proj,v_proj`
- Trainable params: ~1.2M (~0.5% of total)
- Optimizer: AdamW, lr=1e-3, linear warmup 50 steps
- Batch: 8 × grad_accum 2 = effective 16; fp16
- Epochs: 5 with early stopping (patience=2) on eval WER
- Hardware: single NVIDIA T4 (Colab Pro)

## Dataset

DSing30 + curated Jamendo Lyrics subset, vocal-isolated via Demucs htdemucs_ft, song-disjoint train/val/test splits.

## Limitations

- German only (training data was German).
- Heavy distortion / extreme growl vocals are still hard.
- Best results require vocal isolation as a preprocessing step.

## Citation

```
@misc{autolyrics2026,
  author = { ramduvvuri },
  title  = {AUTOLYRICS: LoRA Fine-tuning of Whisper for Singing Lyrics},
  year   = {2026},
  howpublished = {\url{https://github.com/ramduvvuri/autolyrics}}
}
```