Whisper Medium — Hindi ASR Fine-Tuned

A fine-tuned version of openai/whisper-medium on the pavanmantha/indic_asr dataset for Hindi automatic speech recognition (transcription).


Model Details

Model Description

This model is fine-tuned from OpenAI's Whisper Medium checkpoint specifically for Hindi speech transcription. It was trained on the pavanmantha/indic_asr dataset using a supervised sequence-to-sequence approach. The model takes raw audio (16 kHz) as input and outputs the corresponding Hindi transcript.

  • Developed by: Pavan Mantha
  • Model type: Sequence-to-Sequence ASR (Encoder-Decoder Transformer)
  • Language(s): Hindi (hi)
  • License: Apache 2.0
  • Finetuned from: openai/whisper-medium
  • Task: Automatic Speech Recognition — Transcription

Model Sources


Uses

Direct Use

This model can be used out-of-the-box for Hindi speech-to-text transcription without any additional fine-tuning. It is suitable for:

  • Transcribing Hindi audio files or streams
  • Building voice-enabled applications for Hindi speakers
  • Integrating into pipelines for downstream NLP tasks (e.g., translation, summarization)

Downstream Use

The model can be plugged into larger systems such as:

  • Hindi voice assistants
  • Call center analytics for Hindi-language audio
  • Subtitle generation for Hindi media content
  • Multi-lingual ASR pipelines targeting Indic languages

Out-of-Scope Use

  • Non-Hindi audio: The model is specialized for Hindi and will perform poorly on other languages without further fine-tuning.
  • Noisy or far-field audio: Performance may degrade significantly with low-quality microphone recordings, heavy background noise, or telephony-quality audio.
  • Code-switching: Mixed Hindi-English speech (Hinglish) is not explicitly supported.
  • Real-time low-latency applications: Whisper Medium may not meet strict latency requirements for edge deployment without quantization.

Bias, Risks, and Limitations

  • The model's performance is bounded by the quality and diversity of the pavanmantha/indic_asr training data. Underrepresented accents, dialects, or speaking styles may see higher WER.
  • Whisper's tokenizer and generation config are fixed to Hindi; cross-lingual transfer is not guaranteed.
  • Like all ASR systems, the model may misrecognize proper nouns, technical terms, and rare vocabulary.
  • The model does not perform speaker diarization or punctuation restoration by default.

Recommendations

Users should evaluate the model on their target domain (e.g., broadcast, conversational, telephony) before deployment. For production use, consider post-processing with a language model or vocabulary adaptation for domain-specific terms.


How to Get Started with the Model

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="pavanmantha/whisper-medium-hi",  # replace with your Hub model ID
    generate_kwargs={"language": "hindi", "task": "transcribe"}
)

result = asr("path/to/your/hindi_audio.wav")
print(result["text"])

Or using the processor/model directly:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("pavanmantha/whisper-medium-hi")
model = WhisperForConditionalGeneration.from_pretrained("pavanmantha/whisper-medium-hi")
model.eval()

audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="hindi",
        task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Training Data

The model was trained on the pavanmantha/indic_asr dataset, which contains Hindi speech-text pairs. The dataset was split 80/20 into train and test sets (seed=42):

  • Train split: ~80% of total samples
  • Test split: ~20% of total samples
  • Audio sampling rate: 16,000 Hz (resampled using HuggingFace Audio column cast)

Training Procedure

Preprocessing

  • Audio resampled to 16 kHz using the datasets Audio feature.
  • Log-mel spectrograms computed via WhisperFeatureExtractor (80-channel, 30-second context window).
  • Text labels tokenized using WhisperTokenizer with language="Hindi" and task="transcribe".
  • Dynamic padding applied per batch via a custom DataCollatorSpeechSeq2SeqWithPadding.

Training Hyperparameters

Parameter Value
Base model openai/whisper-medium
Training regime BF16 mixed precision
Per-device train batch size 16
Gradient accumulation steps 4 (effective batch = 64)
Learning rate 1e-5
LR scheduler Linear
Warmup steps 500
Epochs 3
Evaluation strategy Every 500 steps
Save strategy Every 500 steps (best model retained)
Generation max length 225 tokens
Gradient checkpointing Enabled
DataLoader workers 4
DataLoader pin memory Enabled
Save total limit 2 checkpoints
Metric for best model WER (lower is better)

Speeds, Sizes, Times

  • Hardware: GPU with BF16 support (e.g., A100 / H100 recommended)
  • Framework: HuggingFace Transformers + Seq2SeqTrainer
  • Logging: TensorBoard

Evaluation

Testing Data

Evaluation was performed on the held-out 20% test split of pavanmantha/indic_asr (seed=42, same as train/test split).

Metrics

Word Error Rate (WER) — the primary metric, computed using the jiwer library. WER measures the edit distance (insertions, deletions, substitutions) between predicted and reference transcripts, expressed as a percentage. Lower is better.

Results

Metric Value
WER (test set) (add your final WER here after training)

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: GPU (BF16 capable — e.g., NVIDIA A100)
  • Hours used: (fill in after training)
  • Cloud Provider: (fill in)
  • Compute Region: (fill in)
  • Carbon Emitted: (fill in)

Technical Specifications

Model Architecture and Objective

  • Architecture: Whisper Medium — 12-layer encoder, 12-layer decoder, 1024 hidden dim, ~307M parameters
  • Objective: Cross-entropy loss on token prediction (teacher-forced during training, autoregressive at inference)
  • Decoder config: forced_decoder_ids=None, language=hindi, task=transcribe

Compute Infrastructure

  • Framework: PyTorch + HuggingFace Transformers
  • Mixed precision: BF16
  • Distributed training: Single-node (multi-GPU supported via Seq2SeqTrainer)

Citation

If you use this model, please cite the original Whisper paper:

BibTeX:

@article{radford2023robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={Proceedings of ICML},
  year={2023}
}

Model Card Authors

Pavan Mantha

Model Card Contact

pavanmantha on HuggingFace

Downloads last month
16
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pavanmantha/whisper-small-hi

Finetuned
(817)
this model

Dataset used to train pavanmantha/whisper-small-hi

Paper for pavanmantha/whisper-small-hi