Whisper Medium — Hindi ASR Fine-Tuned
A fine-tuned version of openai/whisper-medium on the pavanmantha/indic_asr dataset for Hindi automatic speech recognition (transcription).
Model Details
Model Description
This model is fine-tuned from OpenAI's Whisper Medium checkpoint specifically for Hindi speech transcription. It was trained on the pavanmantha/indic_asr dataset using a supervised sequence-to-sequence approach. The model takes raw audio (16 kHz) as input and outputs the corresponding Hindi transcript.
- Developed by: Pavan Mantha
- Model type: Sequence-to-Sequence ASR (Encoder-Decoder Transformer)
- Language(s): Hindi (
hi) - License: Apache 2.0
- Finetuned from: openai/whisper-medium
- Task: Automatic Speech Recognition — Transcription
Model Sources
- Base Model: https://huggingface.co/openai/whisper-medium
- Dataset: https://huggingface.co/datasets/pavanmantha/indic_asr
Uses
Direct Use
This model can be used out-of-the-box for Hindi speech-to-text transcription without any additional fine-tuning. It is suitable for:
- Transcribing Hindi audio files or streams
- Building voice-enabled applications for Hindi speakers
- Integrating into pipelines for downstream NLP tasks (e.g., translation, summarization)
Downstream Use
The model can be plugged into larger systems such as:
- Hindi voice assistants
- Call center analytics for Hindi-language audio
- Subtitle generation for Hindi media content
- Multi-lingual ASR pipelines targeting Indic languages
Out-of-Scope Use
- Non-Hindi audio: The model is specialized for Hindi and will perform poorly on other languages without further fine-tuning.
- Noisy or far-field audio: Performance may degrade significantly with low-quality microphone recordings, heavy background noise, or telephony-quality audio.
- Code-switching: Mixed Hindi-English speech (Hinglish) is not explicitly supported.
- Real-time low-latency applications: Whisper Medium may not meet strict latency requirements for edge deployment without quantization.
Bias, Risks, and Limitations
- The model's performance is bounded by the quality and diversity of the
pavanmantha/indic_asrtraining data. Underrepresented accents, dialects, or speaking styles may see higher WER. - Whisper's tokenizer and generation config are fixed to Hindi; cross-lingual transfer is not guaranteed.
- Like all ASR systems, the model may misrecognize proper nouns, technical terms, and rare vocabulary.
- The model does not perform speaker diarization or punctuation restoration by default.
Recommendations
Users should evaluate the model on their target domain (e.g., broadcast, conversational, telephony) before deployment. For production use, consider post-processing with a language model or vocabulary adaptation for domain-specific terms.
How to Get Started with the Model
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="pavanmantha/whisper-medium-hi", # replace with your Hub model ID
generate_kwargs={"language": "hindi", "task": "transcribe"}
)
result = asr("path/to/your/hindi_audio.wav")
print(result["text"])
Or using the processor/model directly:
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("pavanmantha/whisper-medium-hi")
model = WhisperForConditionalGeneration.from_pretrained("pavanmantha/whisper-medium-hi")
model.eval()
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
language="hindi",
task="transcribe"
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Training Details
Training Data
The model was trained on the pavanmantha/indic_asr dataset, which contains Hindi speech-text pairs. The dataset was split 80/20 into train and test sets (seed=42):
- Train split: ~80% of total samples
- Test split: ~20% of total samples
- Audio sampling rate: 16,000 Hz (resampled using HuggingFace
Audiocolumn cast)
Training Procedure
Preprocessing
- Audio resampled to 16 kHz using the
datasetsAudio feature. - Log-mel spectrograms computed via
WhisperFeatureExtractor(80-channel, 30-second context window). - Text labels tokenized using
WhisperTokenizerwithlanguage="Hindi"andtask="transcribe". - Dynamic padding applied per batch via a custom
DataCollatorSpeechSeq2SeqWithPadding.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | openai/whisper-medium |
| Training regime | BF16 mixed precision |
| Per-device train batch size | 16 |
| Gradient accumulation steps | 4 (effective batch = 64) |
| Learning rate | 1e-5 |
| LR scheduler | Linear |
| Warmup steps | 500 |
| Epochs | 3 |
| Evaluation strategy | Every 500 steps |
| Save strategy | Every 500 steps (best model retained) |
| Generation max length | 225 tokens |
| Gradient checkpointing | Enabled |
| DataLoader workers | 4 |
| DataLoader pin memory | Enabled |
| Save total limit | 2 checkpoints |
| Metric for best model | WER (lower is better) |
Speeds, Sizes, Times
- Hardware: GPU with BF16 support (e.g., A100 / H100 recommended)
- Framework: HuggingFace Transformers +
Seq2SeqTrainer - Logging: TensorBoard
Evaluation
Testing Data
Evaluation was performed on the held-out 20% test split of pavanmantha/indic_asr (seed=42, same as train/test split).
Metrics
Word Error Rate (WER) — the primary metric, computed using the jiwer library. WER measures the edit distance (insertions, deletions, substitutions) between predicted and reference transcripts, expressed as a percentage. Lower is better.
Results
| Metric | Value |
|---|---|
| WER (test set) | (add your final WER here after training) |
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: GPU (BF16 capable — e.g., NVIDIA A100)
- Hours used: (fill in after training)
- Cloud Provider: (fill in)
- Compute Region: (fill in)
- Carbon Emitted: (fill in)
Technical Specifications
Model Architecture and Objective
- Architecture: Whisper Medium — 12-layer encoder, 12-layer decoder, 1024 hidden dim, ~307M parameters
- Objective: Cross-entropy loss on token prediction (teacher-forced during training, autoregressive at inference)
- Decoder config:
forced_decoder_ids=None,language=hindi,task=transcribe
Compute Infrastructure
- Framework: PyTorch + HuggingFace Transformers
- Mixed precision: BF16
- Distributed training: Single-node (multi-GPU supported via
Seq2SeqTrainer)
Citation
If you use this model, please cite the original Whisper paper:
BibTeX:
@article{radford2023robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={Proceedings of ICML},
year={2023}
}
Model Card Authors
Pavan Mantha
Model Card Contact
- Downloads last month
- 16
Model tree for pavanmantha/whisper-small-hi
Base model
openai/whisper-medium