whisper-small-hindi
Fine-tuned openai/whisper-small on approximately 10 hours of conversational Hindi speech, achieving a WER reduction from 48.26% β 31.51% on the in-domain evaluation set.
Model Description
This model is a domain-adapted version of Whisper-small, fine-tuned for conversational Hindi ASR. The base model was adapted using a carefully engineered utterance-level dataset derived from 104 long-form Hindi recordings. Key contributions include a custom data pipeline, disfluency-aware dataset construction, and a novel lattice-based multi-system consensus evaluation framework.
- Model type: Whisper (encoder-decoder Transformer)
- Language: Hindi (hi)
- Task: Automatic Speech Recognition (ASR)
- Base model: openai/whisper-small (244M parameters)
- Fine-tuned by: Joshuva Vinith
- License: MIT
Training Data
The training corpus was constructed from scratch from 104 long-form conversational Hindi recordings (~10 hours total). The pipeline involved:
- Programmatic URL reconstruction and audio retrieval
- JSON-aligned utterance-level segmentation using
pydub - Audio standardization to 16 kHz mono using
librosa - Removal of 209 redacted segments for privacy
- Final usable set: 5,732 utterance clips
A unique-word spelling analysis across 7,457 transcript tokens found 65.5% were orthographically noisy, directly informing text normalization strategy.
Training Configuration
| Parameter | Value |
|---|---|
| Base model | whisper-small |
| Epochs | 4 (best at epoch 3) |
| Batch size | 8 |
| Learning rate | 1e-5 |
| Optimizer | AdamW |
| Weight decay | 0.01 |
| Warmup steps | 500 |
| Mixed precision | FP16 |
| Gradient checkpointing | Enabled |
| Framework | HuggingFace Seq2SeqTrainer |
Evaluation Results
In-domain (Conversational Hindi)
| Epoch | Validation Loss | WER (%) |
|---|---|---|
| 1 | 0.3356 | 39.24 |
| 2 | 0.3030 | 33.01 |
| 3 | 0.3120 | 31.51 |
| 4 | 0.3606 | 32.48 |
Baseline (pretrained Whisper-small): 48.26% WER
Best fine-tuned (epoch 3): 31.51% WER
Improvement: 16.75 percentage points (~35% relative reduction)
Cross-domain Generalization
The fine-tuned model was also evaluated on an external clean read-speech Hindi benchmark. It consistently outperformed the pretrained Whisper-small baseline on this out-of-domain set, demonstrating that fine-tuning improved general Hindi ASR capability beyond the conversational domain.
Lattice-Based Consensus Evaluation
A novel lattice-style multi-system consensus framework was designed to evaluate 6 ASR hypotheses jointly via Levenshtein dynamic programming with gap modeling. Most models showed reduced WER under the lattice reference vs a single human transcript, indicating that some apparent errors were reference disagreements rather than true recognition mistakes.
How to Use
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
import torch
processor = WhisperProcessor.from_pretrained("joshuavinith/whisper-small-hindi")
model = WhisperForConditionalGeneration.from_pretrained("joshuavinith/whisper-small-hindi")
model.eval()
# Load audio (must be 16kHz mono)
audio, sr = librosa.load("your_hindi_audio.wav", sr=16000, mono=True)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs["input_features"],
language="hi",
task="transcribe"
)
transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcript)
Live Demo
Try the model directly in your browser β no code needed:
π huggingface.co/spaces/joshuavinith/hindi-asr-demo
REST API (Docker)
A production-ready FastAPI wrapper is available:
git clone https://github.com/joshuvavinith/hindi-asr-whisper
cd hindi-asr-whisper
docker build -t hindi-asr-api .
docker run -p 8000:8000 hindi-asr-api
Then POST audio to http://localhost:8000/transcribe β interactive docs at /docs.
Limitations
- Trained on conversational Hindi; performance may degrade on highly formal or domain-specific speech (medical, legal, etc.)
- Transcript quality is affected by orthographic noise in the training transcripts (~65.5% of unique word types had spelling issues)
- Model is Whisper-small (244M params); larger Whisper variants would likely yield lower WER
- No speaker diarization β single-speaker transcription only
Citation
If you use this model or the associated pipeline in your work, please cite:
@misc{joshuvavinith2024hindiasr,
author = {Joshuva Vinith},
title = {Hindi ASR: Dataset Construction, Whisper Fine-Tuning, Disfluency Detection, and Lattice Evaluation},
year = {2024},
note = {ArXiv preprint in preparation}
}
Author
Joshuva Vinith
B.Tech β Artificial Intelligence & Data Science
π§ joshuavinith@gmail.com | π GitHub
- Downloads last month
- 69
Model tree for joshuavinith/whisper-small-hindi
Base model
openai/whisper-small