Whisper Large-v2 — Sanskrit ASR (Fine-Tuned)

Fine-tuned version of openai/whisper-large-v2 on the pavanmantha/sanskrit_asr dataset for Sanskrit (sa) automatic speech recognition using HuggingFace Seq2SeqTrainer + DeepSpeed ZeRO-3 via Accelerate on 5× NVIDIA A10G GPUs.

Model Details

Property	Value
Base model	`openai/whisper-large-v2`
Language	Sanskrit (`sa`) — Devanagari script
Task	Automatic Speech Recognition (`transcribe`)
Fine-tuning framework	HuggingFace Transformers + DeepSpeed ZeRO-3
Precision	bf16 (Ampere / sm_86 native)
Parameters	~1.5B
License	Apache 2.0

Training Details

Dataset

	Value
Dataset	`pavanmantha/sanskrit_asr`
Train split	~95% of full dataset (5% held out for validation, seed=42)
Validation split	~5% of full dataset
Audio column	`audio` (resampled to 16 kHz)
Text column	`sentence`

Hardware & Infrastructure

	Value
GPUs	5× NVIDIA A10G (22.5 GB VRAM each)
Instance type	AWS G5 (or equivalent)
Distributed strategy	DeepSpeed ZeRO Stage 3 via Accelerate
ZeRO-3 flags	`stage3_gather_16bit_weights_on_model_save: true`, `overlap_comm: true`, `contiguous_gradients: true`

Hyperparameters

Parameter	Value
Epochs	3
Per-device batch size	8
Gradient accumulation steps	4
Effective batch size	160 (`8 × 4 × 5 GPUs`)
Learning rate	5e-6
LR scheduler	Linear with warmup
Warmup steps	200
Weight decay	0.01
Max grad norm	1.0
Eval beam size	5
Generation max length	225 tokens
Eval & save frequency	Every 500 steps
Best model metric	WER (lower is better)

Text Normalization (WER)

WER is computed after applying a custom Sanskrit/Devanagari normalizer:

NFC Unicode recomposition (fixes encoding variants)
Remove danda (।), double-danda (॥), and common ASCII punctuation
Remove nukta (U+093C): normalizes ज़→ज, फ़→फ
Chandrabindu → Anusvara (U+0901 → U+0902)
Collapse whitespace

Key Training Flags

gradient_checkpointing=True with use_reentrant=False (required for ZeRO-3 compatibility)
forced_decoder_ids=None and suppress_tokens=[] to unblock Devanagari character tokens during generation
predict_with_generate=True with beam search (num_beams=5) for evaluation

Usage

Quick Inference

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="pavanmantha/whisper-medium-sa",
    generate_kwargs={"language": "sanskrit", "task": "transcribe"}
)

result = asr("path/to/audio.mp3")
print(result["text"])

Manual Inference

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "pavanmantha/whisper-medium-sa"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
model.eval()

# Load your audio (must be 16 kHz mono)
# audio_array: np.ndarray, shape (N,), dtype float32

inputs = processor(
    audio_array,
    sampling_rate=16000,
    return_tensors="pt"
)

with torch.no_grad():
    predicted_ids = model.generate(
        inputs["input_features"],
        language="sa",
        task="transcribe",
        num_beams=5
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Evaluation

Split	WER
Validation	TBD

WER values are computed using the custom Devanagari normalizer described above.

Files

File	Description
`model.safetensors`	Fine-tuned model weights
`config.json`	Model architecture config
`generation_config.json`	Generation defaults (`language=sa`, `task=transcribe`)
`tokenizer.json`	Whisper tokenizer
`tokenizer_config.json`	Tokenizer configuration
`processor_config.json`	Processor configuration
`vocab.json`	Vocabulary file
`merges.txt`	BPE merge rules

Limitations

Optimized specifically for Sanskrit (sa) in Devanagari script; performance on other languages or scripts is not guaranteed.
Audio inputs longer than 30 seconds are truncated to the first 30 seconds by the Whisper feature extractor.
Best results on clean, single-speaker audio sampled at 16 kHz.

Citation

If you use this model, please cite the original Whisper paper:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022},
  eprint={2212.04356},
  archivePrefix={arXiv}
}

Author

Fine-tuned by Pavan Kumar Mantha · GitHub

Downloads last month: -

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for pavanmantha/whisper-medium-sa

Base model

openai/whisper-large-v2

Finetuned

(294)

this model

Dataset used to train pavanmantha/whisper-medium-sa

Paper for pavanmantha/whisper-medium-sa

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 55

Evaluation results

WER on pavanmantha/sanskrit_asr
validation set self-reported

TBD