metadata
language:
- hi
license: apache-2.0
base_model: openai/whisper-small
tags:
- automatic-speech-recognition
- hindi
- whisper
- fine-tuned
- conversational-speech
metrics:
- wer
pipeline_tag: automatic-speech-recognition
ποΈ VaaniAI β Whisper-small Fine-tuned on Hindi Conversational Speech
Fine-tuned version of openai/whisper-small on real-world Hindi conversational audio
collected across 102 speakers from India, as part of an AI Researcher Intern
assignment at Josh Talks.
π Model Performance
| Metric | Value |
|---|---|
| Baseline WER (Whisper-small) | 1.2537 |
| Fine-tuned WER | 0.4028 |
| WER Improvement | β 67.8% |
| Post-processing WER gain | β additional 27.7% |
ποΈ Training Data
| Property | Value |
|---|---|
| Total audio | 11.44 hours |
| Speakers | 102 unique speakers across India |
| Segments (after cleaning) | 4,442 |
| Raw segments | 5,941 |
| Train / Val split | 4,093 / 349 |
Cleaning steps applied:
- Removed 209 REDACTED-label segments
- Removed 1,012 sub-1-second clips
- Removed 878 segments with fewer than 5 characters
- Resampled all audio from 44,100 Hz β 16,000 Hz
βοΈ Training Configuration
| Hyperparameter | Value |
|---|---|
| Base model | openai/whisper-small (241.7M params) |
| Learning rate | 1e-5 |
| Effective batch size | 32 (batch 4 Γ grad accum 8) |
| Epochs | 3 |
| Precision | FP16 |
| Hardware | Kaggle T4 GPU (14.6 GB) |
Training loss progression:
| Epoch | Train Loss | Val Loss | WER |
|---|---|---|---|
| 1 | 13.22 | 0.657 | 0.546 |
| 2 | 6.98 | 0.471 | 0.435 |
| 3 | 5.07 | 0.414 | 0.403 |
π§Ή Post-processing Pipeline
1. Repetition Loop Detection
Collapses tokens repeated 4+ times β targets hallucination on noisy audio.
Example: ΰ€ ΰ€ ΰ€... (100x) β ΰ€
2. Spelling Normalization Dictionary
Maps common dialectal Hindi variants to standard spellings.
Example: ΰ€΅ΰ€ΰ₯ΰ€°ΰ€Ύ β ΰ€΅ΰ€ΰ₯ΰ€°ΰ€Ή, ΰ€ΰ€¦ΰ€° β ΰ€ΰ€§ΰ€°
π Error Analysis (25 sampled validation errors)
| Error Type | Count | % |
|---|---|---|
| Phonetic Confusion | 10 | 40% |
| Spelling Variation | 7 | 28% |
| English Loanword Error | 4 | 16% |
| Filler Word Confusion | 3 | 12% |
| Hallucination / Repetition | 1 | 4% |
βοΈ Evaluation: Lattice-Based WER
Implemented a multi-alternative bin-based lattice where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation.
β οΈ Limitations
- Optimized for conversational Hindi; may underperform on formal/broadcast speech
- English loanword transcription remains a known weak point
- Model weights not released due to proprietary training data (Josh Talks internal dataset)
π Links
- π» GitHub: Daksh159/VaaniAI
- π Kaggle Notebook: josh-talks-q1-preprocessing