Surt Small v1 β Gurbani ASR
Surt (meaning "awareness/attention" in Punjabi) is a Whisper-based automatic speech recognition model fine-tuned for Gurbani (Sikh scriptures) in Gurmukhi script.
Model Details
| Parameter | Value |
|---|---|
| Base model | openai/whisper-small (244M params) |
| Language | Punjabi (Gurmukhi script) |
| Task | Transcribe |
| Best WER | 14.88% |
| Best CER | 4.30% |
| Best checkpoint | Step 3400 / 5000 |
| Training data | ~63,700 examples from Sehaj Path recordings |
Evaluation Results
Cross-dataset evaluation (whisper-aligned kirtan, 100 samples)
| Model | WER | CER |
|---|---|---|
| Surt v1 (this model) | 118.19% | 92.82% |
| Whisper Small (baseline) | 607.87% | 529.85% |
While kirtan WER is high (different audio style), the model is 5x better than stock Whisper on Gurbani content. For kirtan-specific use, see surt-small-v1-kirtan.
Training
- Dataset:
surindersinghssj/gurbani-asrβ ~63,700 training + 300 validation examples - Speaker: Giani Mehnga Singh (Sehaj Path)
- Audio: 16kHz, 0.5-17.6s duration
- Hardware: NVIDIA A40 (48 GB), single GPU
- Training time: ~3.3 hours (5000 steps)
- Effective batch size: 64 (batch 32 x gradient accumulation 2)
- Optimizer: AdamW with discriminative LR (encoder=5e-5, decoder=1e-4)
- Scheduler: Cosine decay with 416 warmup steps
- Precision: bf16
WER Progression
| Step | Epoch | WER | CER |
|---|---|---|---|
| 200 | 0.2 | 41.52% | 13.92% |
| 600 | 0.6 | 27.88% | 9.32% |
| 1000 | 1.0 | 23.19% | 7.02% |
| 1800 | 1.8 | 19.22% | 5.90% |
| 2600 | 2.6 | 17.86% | 5.49% |
| 3400 | 3.4 | 14.88% | 4.30% |
| 5000 | 5.0 | 15.39% | 4.59% |
Related Models
| Model | Use case | WER |
|---|---|---|
| surt-small-v1 (this) | Sehaj Path transcription | 14.88% |
surt-small-v1-kirtan |
Kirtan transcription/alignment | 32.65% |
surt-small-v1-training |
Training checkpoint (same as this) | 14.88% |
Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("surindersinghssj/surt-small-v1")
import librosa
audio, sr = librosa.load("your_gurbani_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Note: Use
WhisperProcessor.from_pretrained("openai/whisper-small")for the processor to avoid tokenizer compatibility issues with newer transformers versions.
Limitations
- Trained on a single speaker (Giani Mehnga Singh) β may not generalize well to other Gurbani reciters
- Optimized for Sehaj Path style recitation β for kirtan, use the kirtan model
- Gurmukhi script only β does not produce romanized output
Future Directions
- Mixed-dataset training (Sehaj Path + kirtan) for a unified model
- Fine-tune on more speakers for better generalization
- Convert to faster-whisper (CTranslate2) for production inference
- Test on Bhai Pishora Singh and other Sehaj Path speakers
License
Apache 2.0
Acknowledgments
- Base model by OpenAI
- Training tracked with Weights & Biases
- Downloads last month
- 86
Model tree for surindersinghssj/surt-small-v1
Base model
openai/whisper-smallDataset used to train surindersinghssj/surt-small-v1
Evaluation results
- WER on Gurbani ASRvalidation set self-reported14.880
- CER on Gurbani ASRvalidation set self-reported4.300