Robust Speech Recognition via Large-Scale Weak Supervision
Paper • 2212.04356 • Published • 53
Fine-tuned version of openai/whisper-large-v2 on the pavanmantha/sanskrit_asr dataset for Sanskrit (sa) automatic speech recognition using HuggingFace Seq2SeqTrainer + DeepSpeed ZeRO-3 via Accelerate on 5× NVIDIA A10G GPUs.
| Property | Value |
|---|---|
| Base model | openai/whisper-large-v2 |
| Language | Sanskrit (sa) — Devanagari script |
| Task | Automatic Speech Recognition (transcribe) |
| Fine-tuning framework | HuggingFace Transformers + DeepSpeed ZeRO-3 |
| Precision | bf16 (Ampere / sm_86 native) |
| Parameters | ~1.5B |
| License | Apache 2.0 |
| Value | |
|---|---|
| Dataset | pavanmantha/sanskrit_asr |
| Train split | ~95% of full dataset (5% held out for validation, seed=42) |
| Validation split | ~5% of full dataset |
| Audio column | audio (resampled to 16 kHz) |
| Text column | sentence |
| Value | |
|---|---|
| GPUs | 5× NVIDIA A10G (22.5 GB VRAM each) |
| Instance type | AWS G5 (or equivalent) |
| Distributed strategy | DeepSpeed ZeRO Stage 3 via Accelerate |
| ZeRO-3 flags | stage3_gather_16bit_weights_on_model_save: true, overlap_comm: true, contiguous_gradients: true |
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Per-device batch size | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 160 (8 × 4 × 5 GPUs) |
| Learning rate | 5e-6 |
| LR scheduler | Linear with warmup |
| Warmup steps | 200 |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Eval beam size | 5 |
| Generation max length | 225 tokens |
| Eval & save frequency | Every 500 steps |
| Best model metric | WER (lower is better) |
WER is computed after applying a custom Sanskrit/Devanagari normalizer:
।), double-danda (॥), and common ASCII punctuationU+093C): normalizes ज़→ज, फ़→फU+0901 → U+0902)gradient_checkpointing=True with use_reentrant=False (required for ZeRO-3 compatibility)forced_decoder_ids=None and suppress_tokens=[] to unblock Devanagari character tokens during generationpredict_with_generate=True with beam search (num_beams=5) for evaluationfrom transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="pavanmantha/whisper-medium-sa",
generate_kwargs={"language": "sanskrit", "task": "transcribe"}
)
result = asr("path/to/audio.mp3")
print(result["text"])
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "pavanmantha/whisper-medium-sa"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
model.eval()
# Load your audio (must be 16 kHz mono)
# audio_array: np.ndarray, shape (N,), dtype float32
inputs = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt"
)
with torch.no_grad():
predicted_ids = model.generate(
inputs["input_features"],
language="sa",
task="transcribe",
num_beams=5
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
| Split | WER |
|---|---|
| Validation | TBD |
WER values are computed using the custom Devanagari normalizer described above.
| File | Description |
|---|---|
model.safetensors |
Fine-tuned model weights |
config.json |
Model architecture config |
generation_config.json |
Generation defaults (language=sa, task=transcribe) |
tokenizer.json |
Whisper tokenizer |
tokenizer_config.json |
Tokenizer configuration |
processor_config.json |
Processor configuration |
vocab.json |
Vocabulary file |
merges.txt |
BPE merge rules |
sa) in Devanagari script; performance on other languages or scripts is not guaranteed.If you use this model, please cite the original Whisper paper:
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year={2022},
eprint={2212.04356},
archivePrefix={arXiv}
}
Fine-tuned by Pavan Kumar Mantha · GitHub
Base model
openai/whisper-large-v2