Whisper Small - Hindi Fine-Tuned

This model is a fine-tuned version of openai/whisper-small for Hindi Speech-to-Text (ASR) applications. It was fine-tuned on the hi_in configuration of the Google FLEURS dataset.

The fine-tuning process significantly improved the transcription accuracy, reducing the Word Error Rate (WER) from 68.61% (baseline) down to 26.87%, and the Character Error Rate (CER) from 34.43% down to 10.41%.

Model Details

  • Base Model: openai/whisper-small (244M parameters)
  • Language: Hindi (hi)
  • Task: Automatic Speech Recognition (ASR)
  • Dataset: Google FLEURS (Hindi - India)
  • License: MIT

Evaluation Results

The model was evaluated on the strictly held-out test split (418 samples) of the FLEURS Hindi dataset.

Metric Whisper Small (Base) Whisper Small (Fine-Tuned) Relative Improvement
Word Error Rate (WER) 68.61% 26.87% ↓ 60.8%
Character Error Rate (CER) 34.43% 10.41% ↓ 69.8%

These results demonstrate a highly successful adaptation to Hindi phonetics and Devanagari script, with a massive reduction in transcription errors.

Usage

You can use this model directly through the Hugging Face transformers pipeline for speech recognition:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa

# Load the fine-tuned model and processor
model_id = "rishii100/whisper-small-hindi"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Load your Hindi audio file (resample to 16kHz)
audio_path = "path/to/your/hindi_audio.wav"
audio, sr = librosa.load(audio_path, sr=16000)

# Process and generate transcription
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features

with torch.no_grad():
    predicted_ids = model.generate(input_features, max_length=225)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)

Training Details

The model was trained using the following hyperparameters:

  • Learning Rate: 1e-05
  • Train Batch Size: 16
  • Eval Batch Size: 8
  • Training Steps: 2000
  • Warmup Steps: 250
  • Optimizer: AdamW
  • Mixed Precision: FP16
  • Gradient Accumulation: 1

Intended Use & Limitations

This model is ideal for transcribing general Hindi speech. However, like all speech models, it may experience performance degradation in the following scenarios:

  • High background noise or overlapping speakers.
  • Heavy regional dialects not represented in the standard FLEURS corpus.
  • Extensive code-switching between Hindi and English (Hinglish) where English words are pronounced with thick accents.
Downloads last month
48
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rishii100/whisper-small-hindi

Finetuned
(3496)
this model

Dataset used to train rishii100/whisper-small-hindi

Evaluation results