Surt Small v1 Kirtan β€” Gurbani Kirtan ASR

Fine-tuned from surt-small-v1 (Sehaj Path model) on kirtan audio data for Gurbani kirtan transcription and forced alignment.

Model Details

Parameter Value
Base model surindersinghssj/surt-small-v1-training (step 3400)
Language Punjabi (Gurmukhi script)
Task Transcribe
WER 32.65%
CER 24.62%
Training data 260 kirtan samples from 11 artists

Training

  • Dataset: surindersinghssj/gurbani-asr-whisper-aligned β€” 260 train / 31 eval
  • Artists: 11 kirtan artists (Bhai Manpreet Singh Kanpuri, Bhai Anantvir Singh, etc.)
  • Hardware: NVIDIA A40, single GPU
  • Training time: ~21 minutes (500 steps)
  • Effective batch size: 16 (batch 8 x gradient accumulation 2)
  • Learning rate: 2e-5 (lower than base training since continuing from fine-tuned model)
  • Scheduler: Cosine decay with 50 warmup steps

WER Progression

Step Epoch WER CER
0 (pre-training) β€” 118.19% 92.82%
100 3.0 61.63% 48.85%
200 8.9 41.63% 30.34%
300 14.7 31.84% 23.19%
350 17.7 31.43% 23.19%
500 (final) 29.4 32.65% 24.62%

Related Models

Model Use case WER
surt-small-v1 Sehaj Path transcription 14.88%
surt-small-v1-kirtan (this) Kirtan transcription/alignment 32.65%

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v1-kirtan")

import librosa
audio, sr = librosa.load("kirtan_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Note: Use WhisperProcessor.from_pretrained("openai/whisper-small") for the processor.

Limitations

  • Trained on only 260 samples β€” more data would significantly improve performance
  • Best WER was at step 350 (31.43%), slight overfitting after that
  • Audio-text alignment in training data is imperfect (kirtan involves repetition and musical phrasing)

License

Apache 2.0

Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for surindersinghssj/surt-small-v1-kirtan

Finetuned
(1)
this model

Dataset used to train surindersinghssj/surt-small-v1-kirtan

Evaluation results