whisper-small-hindi

Fine-tuned openai/whisper-small on approximately 10 hours of conversational Hindi speech, achieving a WER reduction from 48.26% → 31.51% on the in-domain evaluation set.

Model Description

This model is a domain-adapted version of Whisper-small, fine-tuned for conversational Hindi ASR. The base model was adapted using a carefully engineered utterance-level dataset derived from 104 long-form Hindi recordings. Key contributions include a custom data pipeline, disfluency-aware dataset construction, and a novel lattice-based multi-system consensus evaluation framework.

Model type: Whisper (encoder-decoder Transformer)
Language: Hindi (hi)
Task: Automatic Speech Recognition (ASR)
Base model: openai/whisper-small (244M parameters)
Fine-tuned by: Joshuva Vinith
License: MIT

Training Data

The training corpus was constructed from scratch from 104 long-form conversational Hindi recordings (~10 hours total). The pipeline involved:

Programmatic URL reconstruction and audio retrieval
JSON-aligned utterance-level segmentation using pydub
Audio standardization to 16 kHz mono using librosa
Removal of 209 redacted segments for privacy
Final usable set: 5,732 utterance clips

A unique-word spelling analysis across 7,457 transcript tokens found 65.5% were orthographically noisy, directly informing text normalization strategy.

Training Configuration

Parameter	Value
Base model	whisper-small
Epochs	4 (best at epoch 3)
Batch size	8
Learning rate	1e-5
Optimizer	AdamW
Weight decay	0.01
Warmup steps	500
Mixed precision	FP16
Gradient checkpointing	Enabled
Framework	HuggingFace Seq2SeqTrainer

Evaluation Results

In-domain (Conversational Hindi)

Epoch	Validation Loss	WER (%)
1	0.3356	39.24
2	0.3030	33.01
3	0.3120	31.51
4	0.3606	32.48

Baseline (pretrained Whisper-small): 48.26% WER
Best fine-tuned (epoch 3): 31.51% WER
Improvement: 16.75 percentage points (~35% relative reduction)

Cross-domain Generalization

The fine-tuned model was also evaluated on an external clean read-speech Hindi benchmark. It consistently outperformed the pretrained Whisper-small baseline on this out-of-domain set, demonstrating that fine-tuning improved general Hindi ASR capability beyond the conversational domain.

Lattice-Based Consensus Evaluation

A novel lattice-style multi-system consensus framework was designed to evaluate 6 ASR hypotheses jointly via Levenshtein dynamic programming with gap modeling. Most models showed reduced WER under the lattice reference vs a single human transcript, indicating that some apparent errors were reference disagreements rather than true recognition mistakes.

How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
import torch

processor = WhisperProcessor.from_pretrained("joshuavinith/whisper-small-hindi")
model = WhisperForConditionalGeneration.from_pretrained("joshuavinith/whisper-small-hindi")
model.eval()

# Load audio (must be 16kHz mono)
audio, sr = librosa.load("your_hindi_audio.wav", sr=16000, mono=True)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs["input_features"],
        language="hi",
        task="transcribe"
    )

transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcript)

Live Demo

Try the model directly in your browser — no code needed:

🔗 huggingface.co/spaces/joshuavinith/hindi-asr-demo

REST API (Docker)

A production-ready FastAPI wrapper is available:

git clone https://github.com/joshuvavinith/hindi-asr-whisper
cd hindi-asr-whisper
docker build -t hindi-asr-api .
docker run -p 8000:8000 hindi-asr-api

Then POST audio to http://localhost:8000/transcribe — interactive docs at /docs.

Limitations

Trained on conversational Hindi; performance may degrade on highly formal or domain-specific speech (medical, legal, etc.)
Transcript quality is affected by orthographic noise in the training transcripts (~65.5% of unique word types had spelling issues)
Model is Whisper-small (244M params); larger Whisper variants would likely yield lower WER
No speaker diarization — single-speaker transcription only

Citation

If you use this model or the associated pipeline in your work, please cite:

@misc{joshuvavinith2024hindiasr,
  author = {Joshuva Vinith},
  title  = {Hindi ASR: Dataset Construction, Whisper Fine-Tuning, Disfluency Detection, and Lattice Evaluation},
  year   = {2024},
  note   = {ArXiv preprint in preparation}
}

Author

Joshuva Vinith
B.Tech — Artificial Intelligence & Data Science
📧 joshuavinith@gmail.com | 🔗 GitHub

Downloads last month: 69

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for joshuavinith/whisper-small-hindi

Base model

openai/whisper-small

Finetuned

(3323)

this model

joshuavinith
/

whisper-small-hindi