whisper-small-hindi

Fine-tuned openai/whisper-small on approximately 10 hours of conversational Hindi speech, achieving a WER reduction from 48.26% β†’ 31.51% on the in-domain evaluation set.


Model Description

This model is a domain-adapted version of Whisper-small, fine-tuned for conversational Hindi ASR. The base model was adapted using a carefully engineered utterance-level dataset derived from 104 long-form Hindi recordings. Key contributions include a custom data pipeline, disfluency-aware dataset construction, and a novel lattice-based multi-system consensus evaluation framework.

  • Model type: Whisper (encoder-decoder Transformer)
  • Language: Hindi (hi)
  • Task: Automatic Speech Recognition (ASR)
  • Base model: openai/whisper-small (244M parameters)
  • Fine-tuned by: Joshuva Vinith
  • License: MIT

Training Data

The training corpus was constructed from scratch from 104 long-form conversational Hindi recordings (~10 hours total). The pipeline involved:

  • Programmatic URL reconstruction and audio retrieval
  • JSON-aligned utterance-level segmentation using pydub
  • Audio standardization to 16 kHz mono using librosa
  • Removal of 209 redacted segments for privacy
  • Final usable set: 5,732 utterance clips

A unique-word spelling analysis across 7,457 transcript tokens found 65.5% were orthographically noisy, directly informing text normalization strategy.


Training Configuration

Parameter Value
Base model whisper-small
Epochs 4 (best at epoch 3)
Batch size 8
Learning rate 1e-5
Optimizer AdamW
Weight decay 0.01
Warmup steps 500
Mixed precision FP16
Gradient checkpointing Enabled
Framework HuggingFace Seq2SeqTrainer

Evaluation Results

In-domain (Conversational Hindi)

Epoch Validation Loss WER (%)
1 0.3356 39.24
2 0.3030 33.01
3 0.3120 31.51
4 0.3606 32.48

Baseline (pretrained Whisper-small): 48.26% WER
Best fine-tuned (epoch 3): 31.51% WER
Improvement: 16.75 percentage points (~35% relative reduction)

Cross-domain Generalization

The fine-tuned model was also evaluated on an external clean read-speech Hindi benchmark. It consistently outperformed the pretrained Whisper-small baseline on this out-of-domain set, demonstrating that fine-tuning improved general Hindi ASR capability beyond the conversational domain.

Lattice-Based Consensus Evaluation

A novel lattice-style multi-system consensus framework was designed to evaluate 6 ASR hypotheses jointly via Levenshtein dynamic programming with gap modeling. Most models showed reduced WER under the lattice reference vs a single human transcript, indicating that some apparent errors were reference disagreements rather than true recognition mistakes.


How to Use

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
import torch

processor = WhisperProcessor.from_pretrained("joshuavinith/whisper-small-hindi")
model = WhisperForConditionalGeneration.from_pretrained("joshuavinith/whisper-small-hindi")
model.eval()

# Load audio (must be 16kHz mono)
audio, sr = librosa.load("your_hindi_audio.wav", sr=16000, mono=True)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs["input_features"],
        language="hi",
        task="transcribe"
    )

transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcript)

Live Demo

Try the model directly in your browser β€” no code needed:

πŸ”— huggingface.co/spaces/joshuavinith/hindi-asr-demo


REST API (Docker)

A production-ready FastAPI wrapper is available:

git clone https://github.com/joshuvavinith/hindi-asr-whisper
cd hindi-asr-whisper
docker build -t hindi-asr-api .
docker run -p 8000:8000 hindi-asr-api

Then POST audio to http://localhost:8000/transcribe β€” interactive docs at /docs.


Limitations

  • Trained on conversational Hindi; performance may degrade on highly formal or domain-specific speech (medical, legal, etc.)
  • Transcript quality is affected by orthographic noise in the training transcripts (~65.5% of unique word types had spelling issues)
  • Model is Whisper-small (244M params); larger Whisper variants would likely yield lower WER
  • No speaker diarization β€” single-speaker transcription only

Citation

If you use this model or the associated pipeline in your work, please cite:

@misc{joshuvavinith2024hindiasr,
  author = {Joshuva Vinith},
  title  = {Hindi ASR: Dataset Construction, Whisper Fine-Tuning, Disfluency Detection, and Lattice Evaluation},
  year   = {2024},
  note   = {ArXiv preprint in preparation}
}

Author

Joshuva Vinith
B.Tech β€” Artificial Intelligence & Data Science
πŸ“§ joshuavinith@gmail.com | πŸ”— GitHub

Downloads last month
69
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for joshuavinith/whisper-small-hindi

Finetuned
(3323)
this model

Space using joshuavinith/whisper-small-hindi 1