Whisper Large-v2 — Sanskrit ASR (Fine-Tuned)

Fine-tuned version of openai/whisper-large-v2 on the pavanmantha/sanskrit_asr dataset for Sanskrit (sa) automatic speech recognition using HuggingFace Seq2SeqTrainer + DeepSpeed ZeRO-3 via Accelerate on 5× NVIDIA A10G GPUs.


Model Details

Property Value
Base model openai/whisper-large-v2
Language Sanskrit (sa) — Devanagari script
Task Automatic Speech Recognition (transcribe)
Fine-tuning framework HuggingFace Transformers + DeepSpeed ZeRO-3
Precision bf16 (Ampere / sm_86 native)
Parameters ~1.5B
License Apache 2.0

Training Details

Dataset

Value
Dataset pavanmantha/sanskrit_asr
Train split ~95% of full dataset (5% held out for validation, seed=42)
Validation split ~5% of full dataset
Audio column audio (resampled to 16 kHz)
Text column sentence

Hardware & Infrastructure

Value
GPUs 5× NVIDIA A10G (22.5 GB VRAM each)
Instance type AWS G5 (or equivalent)
Distributed strategy DeepSpeed ZeRO Stage 3 via Accelerate
ZeRO-3 flags stage3_gather_16bit_weights_on_model_save: true, overlap_comm: true, contiguous_gradients: true

Hyperparameters

Parameter Value
Epochs 3
Per-device batch size 8
Gradient accumulation steps 4
Effective batch size 160 (8 × 4 × 5 GPUs)
Learning rate 5e-6
LR scheduler Linear with warmup
Warmup steps 200
Weight decay 0.01
Max grad norm 1.0
Eval beam size 5
Generation max length 225 tokens
Eval & save frequency Every 500 steps
Best model metric WER (lower is better)

Text Normalization (WER)

WER is computed after applying a custom Sanskrit/Devanagari normalizer:

  • NFC Unicode recomposition (fixes encoding variants)
  • Remove danda (।), double-danda (॥), and common ASCII punctuation
  • Remove nukta (U+093C): normalizes ज़→ज, फ़→फ
  • Chandrabindu → Anusvara (U+0901 → U+0902)
  • Collapse whitespace

Key Training Flags

  • gradient_checkpointing=True with use_reentrant=False (required for ZeRO-3 compatibility)
  • forced_decoder_ids=None and suppress_tokens=[] to unblock Devanagari character tokens during generation
  • predict_with_generate=True with beam search (num_beams=5) for evaluation

Usage

Quick Inference

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="pavanmantha/whisper-medium-sa",
    generate_kwargs={"language": "sanskrit", "task": "transcribe"}
)

result = asr("path/to/audio.mp3")
print(result["text"])

Manual Inference

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "pavanmantha/whisper-medium-sa"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
model.eval()

# Load your audio (must be 16 kHz mono)
# audio_array: np.ndarray, shape (N,), dtype float32

inputs = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt"
)

with torch.no_grad():
    predicted_ids = model.generate(
        inputs["input_features"],
        language="sa",
        task="transcribe",
        num_beams=5
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Evaluation

Split WER
Validation TBD

WER values are computed using the custom Devanagari normalizer described above.


Files

File Description
model.safetensors Fine-tuned model weights
config.json Model architecture config
generation_config.json Generation defaults (language=sa, task=transcribe)
tokenizer.json Whisper tokenizer
tokenizer_config.json Tokenizer configuration
processor_config.json Processor configuration
vocab.json Vocabulary file
merges.txt BPE merge rules

Limitations

  • Optimized specifically for Sanskrit (sa) in Devanagari script; performance on other languages or scripts is not guaranteed.
  • Audio inputs longer than 30 seconds are truncated to the first 30 seconds by the Whisper feature extractor.
  • Best results on clean, single-speaker audio sampled at 16 kHz.

Citation

If you use this model, please cite the original Whisper paper:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv}
}

Author

Fine-tuned by Pavan Kumar Mantha · GitHub

Downloads last month
56
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pavanmantha/whisper-medium-sa

Finetuned
(266)
this model

Dataset used to train pavanmantha/whisper-medium-sa

Paper for pavanmantha/whisper-medium-sa

Evaluation results