surt-small-v1 / README.md
surindersinghssj's picture
Update model card with eval results, kirtan comparison, and future directions
140d8e4 verified
metadata
language:
  - pa
license: apache-2.0
base_model: openai/whisper-small
tags:
  - whisper
  - automatic-speech-recognition
  - gurbani
  - punjabi
  - gurmukhi
  - sehaj-path
datasets:
  - surindersinghssj/gurbani-asr
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: surt-small-v1
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Gurbani ASR
          type: surindersinghssj/gurbani-asr
          split: validation
        metrics:
          - type: wer
            value: 14.88
            name: WER
          - type: cer
            value: 4.3
            name: CER

Surt Small v1 — Gurbani ASR

Surt (meaning "awareness/attention" in Punjabi) is a Whisper-based automatic speech recognition model fine-tuned for Gurbani (Sikh scriptures) in Gurmukhi script.

Model Details

Parameter Value
Base model openai/whisper-small (244M params)
Language Punjabi (Gurmukhi script)
Task Transcribe
Best WER 14.88%
Best CER 4.30%
Best checkpoint Step 3400 / 5000
Training data ~63,700 examples from Sehaj Path recordings

Evaluation Results

Cross-dataset evaluation (whisper-aligned kirtan, 100 samples)

Model WER CER
Surt v1 (this model) 118.19% 92.82%
Whisper Small (baseline) 607.87% 529.85%

While kirtan WER is high (different audio style), the model is 5x better than stock Whisper on Gurbani content. For kirtan-specific use, see surt-small-v1-kirtan.

Training

  • Dataset: surindersinghssj/gurbani-asr — ~63,700 training + 300 validation examples
  • Speaker: Giani Mehnga Singh (Sehaj Path)
  • Audio: 16kHz, 0.5-17.6s duration
  • Hardware: NVIDIA A40 (48 GB), single GPU
  • Training time: ~3.3 hours (5000 steps)
  • Effective batch size: 64 (batch 32 x gradient accumulation 2)
  • Optimizer: AdamW with discriminative LR (encoder=5e-5, decoder=1e-4)
  • Scheduler: Cosine decay with 416 warmup steps
  • Precision: bf16

WER Progression

Step Epoch WER CER
200 0.2 41.52% 13.92%
600 0.6 27.88% 9.32%
1000 1.0 23.19% 7.02%
1800 1.8 19.22% 5.90%
2600 2.6 17.86% 5.49%
3400 3.4 14.88% 4.30%
5000 5.0 15.39% 4.59%

Related Models

Model Use case WER
surt-small-v1 (this) Sehaj Path transcription 14.88%
surt-small-v1-kirtan Kirtan transcription/alignment 32.65%
surt-small-v1-training Training checkpoint (same as this) 14.88%

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v1")

import librosa
audio, sr = librosa.load("your_gurbani_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Note: Use WhisperProcessor.from_pretrained("openai/whisper-small") for the processor to avoid tokenizer compatibility issues with newer transformers versions.

Limitations

  • Trained on a single speaker (Giani Mehnga Singh) — may not generalize well to other Gurbani reciters
  • Optimized for Sehaj Path style recitation — for kirtan, use the kirtan model
  • Gurmukhi script only — does not produce romanized output

Future Directions

  • Mixed-dataset training (Sehaj Path + kirtan) for a unified model
  • Fine-tune on more speakers for better generalization
  • Convert to faster-whisper (CTranslate2) for production inference
  • Test on Bhai Pishora Singh and other Sehaj Path speakers

License

Apache 2.0

Acknowledgments