Surt Small v1 β€” Gurbani ASR

Surt (meaning "awareness/attention" in Punjabi) is a Whisper-based automatic speech recognition model fine-tuned for Gurbani (Sikh scriptures) in Gurmukhi script.

Model Details

Parameter Value
Base model openai/whisper-small (244M params)
Language Punjabi (Gurmukhi script)
Task Transcribe
Best WER 14.88%
Best CER 4.30%
Best checkpoint Step 3400 / 5000
Training data ~63,700 examples from Sehaj Path recordings

Evaluation Results

Cross-dataset evaluation (whisper-aligned kirtan, 100 samples)

Model WER CER
Surt v1 (this model) 118.19% 92.82%
Whisper Small (baseline) 607.87% 529.85%

While kirtan WER is high (different audio style), the model is 5x better than stock Whisper on Gurbani content. For kirtan-specific use, see surt-small-v1-kirtan.

Training

  • Dataset: surindersinghssj/gurbani-asr β€” ~63,700 training + 300 validation examples
  • Speaker: Giani Mehnga Singh (Sehaj Path)
  • Audio: 16kHz, 0.5-17.6s duration
  • Hardware: NVIDIA A40 (48 GB), single GPU
  • Training time: ~3.3 hours (5000 steps)
  • Effective batch size: 64 (batch 32 x gradient accumulation 2)
  • Optimizer: AdamW with discriminative LR (encoder=5e-5, decoder=1e-4)
  • Scheduler: Cosine decay with 416 warmup steps
  • Precision: bf16

WER Progression

Step Epoch WER CER
200 0.2 41.52% 13.92%
600 0.6 27.88% 9.32%
1000 1.0 23.19% 7.02%
1800 1.8 19.22% 5.90%
2600 2.6 17.86% 5.49%
3400 3.4 14.88% 4.30%
5000 5.0 15.39% 4.59%

Related Models

Model Use case WER
surt-small-v1 (this) Sehaj Path transcription 14.88%
surt-small-v1-kirtan Kirtan transcription/alignment 32.65%
surt-small-v1-training Training checkpoint (same as this) 14.88%

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v1")

import librosa
audio, sr = librosa.load("your_gurbani_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Note: Use WhisperProcessor.from_pretrained("openai/whisper-small") for the processor to avoid tokenizer compatibility issues with newer transformers versions.

Limitations

  • Trained on a single speaker (Giani Mehnga Singh) β€” may not generalize well to other Gurbani reciters
  • Optimized for Sehaj Path style recitation β€” for kirtan, use the kirtan model
  • Gurmukhi script only β€” does not produce romanized output

Future Directions

  • Mixed-dataset training (Sehaj Path + kirtan) for a unified model
  • Fine-tune on more speakers for better generalization
  • Convert to faster-whisper (CTranslate2) for production inference
  • Test on Bhai Pishora Singh and other Sehaj Path speakers

License

Apache 2.0

Acknowledgments

Downloads last month
86
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for surindersinghssj/surt-small-v1

Finetuned
(3382)
this model

Dataset used to train surindersinghssj/surt-small-v1

Evaluation results