Surt Small v1 Kirtan β Gurbani Kirtan ASR
Fine-tuned from surt-small-v1 (Sehaj Path model) on kirtan audio data for Gurbani kirtan transcription and forced alignment.
Model Details
| Parameter | Value |
|---|---|
| Base model | surindersinghssj/surt-small-v1-training (step 3400) |
| Language | Punjabi (Gurmukhi script) |
| Task | Transcribe |
| WER | 32.65% |
| CER | 24.62% |
| Training data | 260 kirtan samples from 11 artists |
Training
- Dataset:
surindersinghssj/gurbani-asr-whisper-alignedβ 260 train / 31 eval - Artists: 11 kirtan artists (Bhai Manpreet Singh Kanpuri, Bhai Anantvir Singh, etc.)
- Hardware: NVIDIA A40, single GPU
- Training time: ~21 minutes (500 steps)
- Effective batch size: 16 (batch 8 x gradient accumulation 2)
- Learning rate: 2e-5 (lower than base training since continuing from fine-tuned model)
- Scheduler: Cosine decay with 50 warmup steps
WER Progression
| Step | Epoch | WER | CER |
|---|---|---|---|
| 0 (pre-training) | β | 118.19% | 92.82% |
| 100 | 3.0 | 61.63% | 48.85% |
| 200 | 8.9 | 41.63% | 30.34% |
| 300 | 14.7 | 31.84% | 23.19% |
| 350 | 17.7 | 31.43% | 23.19% |
| 500 (final) | 29.4 | 32.65% | 24.62% |
Related Models
| Model | Use case | WER |
|---|---|---|
surt-small-v1 |
Sehaj Path transcription | 14.88% |
| surt-small-v1-kirtan (this) | Kirtan transcription/alignment | 32.65% |
Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v1-kirtan")
import librosa
audio, sr = librosa.load("kirtan_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Note: Use
WhisperProcessor.from_pretrained("openai/whisper-small")for the processor.
Limitations
- Trained on only 260 samples β more data would significantly improve performance
- Best WER was at step 350 (31.43%), slight overfitting after that
- Audio-text alignment in training data is imperfect (kirtan involves repetition and musical phrasing)
License
Apache 2.0
- Downloads last month
- 27
Model tree for surindersinghssj/surt-small-v1-kirtan
Base model
openai/whisper-small Finetuned
surindersinghssj/surt-small-v1-trainingDataset used to train surindersinghssj/surt-small-v1-kirtan
Evaluation results
- WER on Gurbani ASR Whisper Aligned (Kirtan)test set self-reported32.650
- CER on Gurbani ASR Whisper Aligned (Kirtan)test set self-reported24.620