Whisper Small Hindi
This model is based on openai/whisper-small and was used as part of the
Josh Talks – AI Researcher Intern (Speech & Audio) assignment.
The project focused on:
- Hindi Automatic Speech Recognition (ASR)
- Disfluency detection in spoken Hindi
- Data preprocessing, evaluation, and deployment workflows
Model description
This repository hosts a Whisper Small ASR model used for Hindi speech recognition experiments. The work demonstrates an end-to-end ASR pipeline including preprocessing, inference, evaluation, and Hugging Face deployment.
In addition to transcription, the project includes disfluency detection, where common Hindi fillers and hesitations were identified and corresponding audio segments were extracted.
Intended uses & limitations
Intended uses
- Hindi speech-to-text transcription
- ASR experimentation and learning
- Studying disfluencies in conversational Hindi
- Benchmarking Whisper performance on small, task-specific datasets
Limitations
- Not trained on large-scale Hindi corpora
- GPU-based fine-tuning showed higher WER due to half-precision and decoding behavior
- Not suitable for production or safety-critical ASR systems
Training and evaluation data
The project used custom Hindi speech data provided as part of the Josh Talks assignment, including:
- Conversational Hindi speech
- Disfluency-rich recordings
- Cleaned and normalized transcripts
Audio and transcripts were managed using CSV and JSON files and processed with Librosa.
Training procedure
Preprocessing
- Audio resampled to 16 kHz
- Transcripts lowercased and normalized
- Punctuation removed
- Silence and background noise trimmed
- Hindi disfluency keywords matched from a curated list
Training hyperparameters
The following settings were used during experimentation:
- learning_rate: 1e-05
- optimizer: Adam
- precision: fp32 (CPU), fp16 (GPU)
- framework: Hugging Face Transformers (Trainer-based workflow)
Evaluation results (reference)
| Model Variant | WER |
|---|---|
| Whisper Small (Pretrained) | 0.22 |
| Whisper Small (Fine-tuned – CPU) | 0.00 |
| Whisper Small (Fine-tuned – GPU) | 69.41 |
The GPU run showed higher WER due to half-precision computation and decoding differences.
Disfluency detection (related task)
Disfluencies were detected using:
- A list of 67 Hindi disfluency keywords (e.g., “उह”, “हम्म”, “अं”, “अच्छ”)
- Keyword matching on normalized transcripts
For each detected disfluency:
- A 3-second audio segment was extracted
- Saved as
recording_id_disfluency.wav - All audio processed at 16 kHz using Librosa
The output was stored in CSV format with the following fields:
recording_iddisfluency_foundsegment_pathtimestamp_starttimestamp_end
Framework versions
- Transformers
- PyTorch
- Librosa
- Datasets
- Evaluate
Notes
This model repository is intended for educational and experimental purposes. Fine-tuned checkpoints, additional benchmarks, and demo applications may be added in future versions.
- Downloads last month
- 20
Model tree for Satyam0077/whisper-small-hindi
Base model
openai/whisper-small