Whisper Small Hindi

This model is based on openai/whisper-small and was used as part of the
Josh Talks – AI Researcher Intern (Speech & Audio) assignment.

The project focused on:

Hindi Automatic Speech Recognition (ASR)
Disfluency detection in spoken Hindi
Data preprocessing, evaluation, and deployment workflows

Model description

This repository hosts a Whisper Small ASR model used for Hindi speech recognition experiments. The work demonstrates an end-to-end ASR pipeline including preprocessing, inference, evaluation, and Hugging Face deployment.

In addition to transcription, the project includes disfluency detection, where common Hindi fillers and hesitations were identified and corresponding audio segments were extracted.

Intended uses & limitations

Intended uses

Hindi speech-to-text transcription
ASR experimentation and learning
Studying disfluencies in conversational Hindi
Benchmarking Whisper performance on small, task-specific datasets

Limitations

Not trained on large-scale Hindi corpora
GPU-based fine-tuning showed higher WER due to half-precision and decoding behavior
Not suitable for production or safety-critical ASR systems

Training and evaluation data

The project used custom Hindi speech data provided as part of the Josh Talks assignment, including:

Conversational Hindi speech
Disfluency-rich recordings
Cleaned and normalized transcripts

Audio and transcripts were managed using CSV and JSON files and processed with Librosa.

Training procedure

Preprocessing

Audio resampled to 16 kHz
Transcripts lowercased and normalized
Punctuation removed
Silence and background noise trimmed
Hindi disfluency keywords matched from a curated list

Training hyperparameters

The following settings were used during experimentation:

learning_rate: 1e-05
optimizer: Adam
precision: fp32 (CPU), fp16 (GPU)
framework: Hugging Face Transformers (Trainer-based workflow)

Evaluation results (reference)

Model Variant	WER
Whisper Small (Pretrained)	0.22
Whisper Small (Fine-tuned – CPU)	0.00
Whisper Small (Fine-tuned – GPU)	69.41

The GPU run showed higher WER due to half-precision computation and decoding differences.

Disfluency detection (related task)

Disfluencies were detected using:

A list of 67 Hindi disfluency keywords (e.g., “उह”, “हम्म”, “अं”, “अच्छ”)
Keyword matching on normalized transcripts

For each detected disfluency:

A 3-second audio segment was extracted
Saved as recording_id_disfluency.wav
All audio processed at 16 kHz using Librosa

The output was stored in CSV format with the following fields:

recording_id
disfluency_found
segment_path
timestamp_start
timestamp_end

Framework versions

Transformers
PyTorch
Librosa
Datasets
Evaluate

Notes

This model repository is intended for educational and experimental purposes. Fine-tuned checkpoints, additional benchmarks, and demo applications may be added in future versions.

Downloads last month: 20

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Satyam0077/whisper-small-hindi

Base model

openai/whisper-small

Finetuned

(3176)

this model