surt-small-v1 / README.md

surindersinghssj

Update model card with eval results, kirtan comparison, and future directions

140d8e4 verified 20 days ago

preview code

raw

history blame contribute delete

4.43 kB

metadata

language:
  - pa
license: apache-2.0
base_model: openai/whisper-small
tags:
  - whisper
  - automatic-speech-recognition
  - gurbani
  - punjabi
  - gurmukhi
  - sehaj-path
datasets:
  - surindersinghssj/gurbani-asr
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition
model-index:
  - name: surt-small-v1
    results:
      - task:
          type: automatic-speech-recognition
          name: Speech Recognition
        dataset:
          name: Gurbani ASR
          type: surindersinghssj/gurbani-asr
          split: validation
        metrics:
          - type: wer
            value: 14.88
            name: WER
          - type: cer
            value: 4.3
            name: CER

Surt Small v1 — Gurbani ASR

Surt (meaning "awareness/attention" in Punjabi) is a Whisper-based automatic speech recognition model fine-tuned for Gurbani (Sikh scriptures) in Gurmukhi script.

Model Details

Parameter	Value
Base model	`openai/whisper-small` (244M params)
Language	Punjabi (Gurmukhi script)
Task	Transcribe
Best WER	14.88%
Best CER	4.30%
Best checkpoint	Step 3400 / 5000
Training data	~63,700 examples from Sehaj Path recordings

Evaluation Results

Cross-dataset evaluation (whisper-aligned kirtan, 100 samples)

Model	WER	CER
Surt v1 (this model)	118.19%	92.82%
Whisper Small (baseline)	607.87%	529.85%

While kirtan WER is high (different audio style), the model is 5x better than stock Whisper on Gurbani content. For kirtan-specific use, see surt-small-v1-kirtan.

Training

Dataset: surindersinghssj/gurbani-asr — ~63,700 training + 300 validation examples
Speaker: Giani Mehnga Singh (Sehaj Path)
Audio: 16kHz, 0.5-17.6s duration
Hardware: NVIDIA A40 (48 GB), single GPU
Training time: ~3.3 hours (5000 steps)
Effective batch size: 64 (batch 32 x gradient accumulation 2)
Optimizer: AdamW with discriminative LR (encoder=5e-5, decoder=1e-4)
Scheduler: Cosine decay with 416 warmup steps
Precision: bf16

WER Progression

Step	Epoch	WER	CER
200	0.2	41.52%	13.92%
600	0.6	27.88%	9.32%
1000	1.0	23.19%	7.02%
1800	1.8	19.22%	5.90%
2600	2.6	17.86%	5.49%
3400	3.4	14.88%	4.30%
5000	5.0	15.39%	4.59%

Related Models

Model	Use case	WER
surt-small-v1 (this)	Sehaj Path transcription	14.88%
`surt-small-v1-kirtan`	Kirtan transcription/alignment	32.65%
`surt-small-v1-training`	Training checkpoint (same as this)	14.88%

Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v1")

import librosa
audio, sr = librosa.load("your_gurbani_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Note: Use WhisperProcessor.from_pretrained("openai/whisper-small") for the processor to avoid tokenizer compatibility issues with newer transformers versions.

Limitations

Trained on a single speaker (Giani Mehnga Singh) — may not generalize well to other Gurbani reciters
Optimized for Sehaj Path style recitation — for kirtan, use the kirtan model
Gurmukhi script only — does not produce romanized output

Future Directions

Mixed-dataset training (Sehaj Path + kirtan) for a unified model
Fine-tune on more speakers for better generalization
Convert to faster-whisper (CTranslate2) for production inference
Test on Bhai Pishora Singh and other Sehaj Path speakers

License

Apache 2.0

Acknowledgments

Base model by OpenAI
Training tracked with Weights & Biases