Whisper Medium — Hindi ASR Fine-Tuned

A fine-tuned version of openai/whisper-medium on the pavanmantha/indic_asr dataset for Hindi automatic speech recognition (transcription).

Model Details

Model Description

This model is fine-tuned from OpenAI's Whisper Medium checkpoint specifically for Hindi speech transcription. It was trained on the pavanmantha/indic_asr dataset using a supervised sequence-to-sequence approach. The model takes raw audio (16 kHz) as input and outputs the corresponding Hindi transcript.

Developed by: Pavan Mantha
Model type: Sequence-to-Sequence ASR (Encoder-Decoder Transformer)
Language(s): Hindi (hi)
License: Apache 2.0
Finetuned from: openai/whisper-medium
Task: Automatic Speech Recognition — Transcription

Model Sources

Base Model: https://huggingface.co/openai/whisper-medium
Dataset: https://huggingface.co/datasets/pavanmantha/indic_asr

Uses

Direct Use

This model can be used out-of-the-box for Hindi speech-to-text transcription without any additional fine-tuning. It is suitable for:

Transcribing Hindi audio files or streams
Building voice-enabled applications for Hindi speakers
Integrating into pipelines for downstream NLP tasks (e.g., translation, summarization)

Downstream Use

The model can be plugged into larger systems such as:

Hindi voice assistants
Call center analytics for Hindi-language audio
Subtitle generation for Hindi media content
Multi-lingual ASR pipelines targeting Indic languages

Out-of-Scope Use

Non-Hindi audio: The model is specialized for Hindi and will perform poorly on other languages without further fine-tuning.
Noisy or far-field audio: Performance may degrade significantly with low-quality microphone recordings, heavy background noise, or telephony-quality audio.
Code-switching: Mixed Hindi-English speech (Hinglish) is not explicitly supported.
Real-time low-latency applications: Whisper Medium may not meet strict latency requirements for edge deployment without quantization.

Bias, Risks, and Limitations

The model's performance is bounded by the quality and diversity of the pavanmantha/indic_asr training data. Underrepresented accents, dialects, or speaking styles may see higher WER.
Whisper's tokenizer and generation config are fixed to Hindi; cross-lingual transfer is not guaranteed.
Like all ASR systems, the model may misrecognize proper nouns, technical terms, and rare vocabulary.
The model does not perform speaker diarization or punctuation restoration by default.

Recommendations

Users should evaluate the model on their target domain (e.g., broadcast, conversational, telephony) before deployment. For production use, consider post-processing with a language model or vocabulary adaptation for domain-specific terms.

How to Get Started with the Model

from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="pavanmantha/whisper-medium-hi",  # replace with your Hub model ID
    generate_kwargs={"language": "hindi", "task": "transcribe"}
)

result = asr("path/to/your/hindi_audio.wav")
print(result["text"])

Or using the processor/model directly:

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("pavanmantha/whisper-medium-hi")
model = WhisperForConditionalGeneration.from_pretrained("pavanmantha/whisper-medium-hi")
model.eval()

audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="hindi",
        task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Training Details

Training Data

The model was trained on the pavanmantha/indic_asr dataset, which contains Hindi speech-text pairs. The dataset was split 80/20 into train and test sets (seed=42):

Train split: ~80% of total samples
Test split: ~20% of total samples
Audio sampling rate: 16,000 Hz (resampled using HuggingFace Audio column cast)

Training Procedure

Preprocessing

Audio resampled to 16 kHz using the datasets Audio feature.
Log-mel spectrograms computed via WhisperFeatureExtractor (80-channel, 30-second context window).
Text labels tokenized using WhisperTokenizer with language="Hindi" and task="transcribe".
Dynamic padding applied per batch via a custom DataCollatorSpeechSeq2SeqWithPadding.

Training Hyperparameters

Parameter	Value
Base model	openai/whisper-medium
Training regime	BF16 mixed precision
Per-device train batch size	16
Gradient accumulation steps	4 (effective batch = 64)
Learning rate	1e-5
LR scheduler	Linear
Warmup steps	500
Epochs	3
Evaluation strategy	Every 500 steps
Save strategy	Every 500 steps (best model retained)
Generation max length	225 tokens
Gradient checkpointing	Enabled
DataLoader workers	4
DataLoader pin memory	Enabled
Save total limit	2 checkpoints
Metric for best model	WER (lower is better)

Speeds, Sizes, Times

Hardware: GPU with BF16 support (e.g., A100 / H100 recommended)
Framework: HuggingFace Transformers + Seq2SeqTrainer
Logging: TensorBoard

Evaluation

Testing Data

Evaluation was performed on the held-out 20% test split of pavanmantha/indic_asr (seed=42, same as train/test split).

Metrics

Word Error Rate (WER) — the primary metric, computed using the jiwer library. WER measures the edit distance (insertions, deletions, substitutions) between predicted and reference transcripts, expressed as a percentage. Lower is better.

Results

Metric	Value
WER (test set)	(add your final WER here after training)

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: GPU (BF16 capable — e.g., NVIDIA A100)
Hours used: (fill in after training)
Cloud Provider: (fill in)
Compute Region: (fill in)
Carbon Emitted: (fill in)

Technical Specifications

Model Architecture and Objective

Architecture: Whisper Medium — 12-layer encoder, 12-layer decoder, 1024 hidden dim, ~307M parameters
Objective: Cross-entropy loss on token prediction (teacher-forced during training, autoregressive at inference)
Decoder config: forced_decoder_ids=None, language=hindi, task=transcribe

Compute Infrastructure

Framework: PyTorch + HuggingFace Transformers
Mixed precision: BF16
Distributed training: Single-node (multi-GPU supported via Seq2SeqTrainer)

Citation

If you use this model, please cite the original Whisper paper:

BibTeX:

@article{radford2023robust,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={Proceedings of ICML},
  year={2023}
}

Model Card Authors

Pavan Mantha

Model Card Contact

pavanmantha on HuggingFace

Downloads last month: 10

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for pavanmantha/whisper-small-hi

Base model

openai/whisper-medium

Finetuned

(868)

this model

Dataset used to train pavanmantha/whisper-small-hi

Paper for pavanmantha/whisper-small-hi

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 47