Whisper Small Hindi

This model is based on openai/whisper-small and was used as part of the
Josh Talks – AI Researcher Intern (Speech & Audio) assignment.

The project focused on:

  • Hindi Automatic Speech Recognition (ASR)
  • Disfluency detection in spoken Hindi
  • Data preprocessing, evaluation, and deployment workflows

Model description

This repository hosts a Whisper Small ASR model used for Hindi speech recognition experiments. The work demonstrates an end-to-end ASR pipeline including preprocessing, inference, evaluation, and Hugging Face deployment.

In addition to transcription, the project includes disfluency detection, where common Hindi fillers and hesitations were identified and corresponding audio segments were extracted.


Intended uses & limitations

Intended uses

  • Hindi speech-to-text transcription
  • ASR experimentation and learning
  • Studying disfluencies in conversational Hindi
  • Benchmarking Whisper performance on small, task-specific datasets

Limitations

  • Not trained on large-scale Hindi corpora
  • GPU-based fine-tuning showed higher WER due to half-precision and decoding behavior
  • Not suitable for production or safety-critical ASR systems

Training and evaluation data

The project used custom Hindi speech data provided as part of the Josh Talks assignment, including:

  • Conversational Hindi speech
  • Disfluency-rich recordings
  • Cleaned and normalized transcripts

Audio and transcripts were managed using CSV and JSON files and processed with Librosa.


Training procedure

Preprocessing

  • Audio resampled to 16 kHz
  • Transcripts lowercased and normalized
  • Punctuation removed
  • Silence and background noise trimmed
  • Hindi disfluency keywords matched from a curated list

Training hyperparameters

The following settings were used during experimentation:

  • learning_rate: 1e-05
  • optimizer: Adam
  • precision: fp32 (CPU), fp16 (GPU)
  • framework: Hugging Face Transformers (Trainer-based workflow)

Evaluation results (reference)

Model Variant WER
Whisper Small (Pretrained) 0.22
Whisper Small (Fine-tuned – CPU) 0.00
Whisper Small (Fine-tuned – GPU) 69.41

The GPU run showed higher WER due to half-precision computation and decoding differences.


Disfluency detection (related task)

Disfluencies were detected using:

  • A list of 67 Hindi disfluency keywords (e.g., “उह”, “हम्म”, “अं”, “अच्छ”)
  • Keyword matching on normalized transcripts

For each detected disfluency:

  • A 3-second audio segment was extracted
  • Saved as recording_id_disfluency.wav
  • All audio processed at 16 kHz using Librosa

The output was stored in CSV format with the following fields:

  • recording_id
  • disfluency_found
  • segment_path
  • timestamp_start
  • timestamp_end

Framework versions

  • Transformers
  • PyTorch
  • Librosa
  • Datasets
  • Evaluate

Notes

This model repository is intended for educational and experimental purposes. Fine-tuned checkpoints, additional benchmarks, and demo applications may be added in future versions.

Downloads last month
20
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Satyam0077/whisper-small-hindi

Finetuned
(3176)
this model