| --- |
| language: |
| - hi |
| license: apache-2.0 |
| base_model: openai/whisper-small |
| tags: |
| - automatic-speech-recognition |
| - hindi |
| - whisper |
| - fine-tuned |
| - conversational-speech |
| metrics: |
| - wer |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # ποΈ VaaniAI β Whisper-small Fine-tuned on Hindi Conversational Speech |
|
|
| Fine-tuned version of `openai/whisper-small` on real-world Hindi conversational audio |
| collected across **102 speakers** from India, as part of an AI Researcher Intern |
| assignment at **Josh Talks**. |
|
|
| --- |
|
|
| ## π Model Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Baseline WER (Whisper-small) | 1.2537 | |
| | Fine-tuned WER | **0.4028** | |
| | WER Improvement | β **67.8%** | |
| | Post-processing WER gain | β additional **27.7%** | |
|
|
| --- |
|
|
| ## ποΈ Training Data |
|
|
| | Property | Value | |
| |----------|-------| |
| | Total audio | 11.44 hours | |
| | Speakers | 102 unique speakers across India | |
| | Segments (after cleaning) | 4,442 | |
| | Raw segments | 5,941 | |
| | Train / Val split | 4,093 / 349 | |
|
|
| **Cleaning steps applied:** |
| - Removed 209 REDACTED-label segments |
| - Removed 1,012 sub-1-second clips |
| - Removed 878 segments with fewer than 5 characters |
| - Resampled all audio from 44,100 Hz β 16,000 Hz |
|
|
| --- |
|
|
| ## βοΈ Training Configuration |
|
|
| | Hyperparameter | Value | |
| |----------------|-------| |
| | Base model | openai/whisper-small (241.7M params) | |
| | Learning rate | 1e-5 | |
| | Effective batch size | 32 (batch 4 Γ grad accum 8) | |
| | Epochs | 3 | |
| | Precision | FP16 | |
| | Hardware | Kaggle T4 GPU (14.6 GB) | |
|
|
| **Training loss progression:** |
|
|
| | Epoch | Train Loss | Val Loss | WER | |
| |-------|-----------|----------|-----| |
| | 1 | 13.22 | 0.657 | 0.546 | |
| | 2 | 6.98 | 0.471 | 0.435 | |
| | 3 | 5.07 | 0.414 | **0.403** | |
|
|
| --- |
|
|
| ## π§Ή Post-processing Pipeline |
|
|
| **1. Repetition Loop Detection** |
| Collapses tokens repeated 4+ times β targets hallucination on noisy audio. |
| Example: `ΰ€ ΰ€ ΰ€... (100x)` β `ΰ€` |
|
|
| **2. Spelling Normalization Dictionary** |
| Maps common dialectal Hindi variants to standard spellings. |
| Example: `ΰ€΅ΰ€ΰ₯ΰ€°ΰ€Ύ` β `ΰ€΅ΰ€ΰ₯ΰ€°ΰ€Ή`, `ΰ€ΰ€¦ΰ€°` β `ΰ€ΰ€§ΰ€°` |
|
|
| --- |
|
|
| ## π Error Analysis (25 sampled validation errors) |
|
|
| | Error Type | Count | % | |
| |------------|-------|---| |
| | Phonetic Confusion | 10 | 40% | |
| | Spelling Variation | 7 | 28% | |
| | English Loanword Error | 4 | 16% | |
| | Filler Word Confusion | 3 | 12% | |
| | Hallucination / Repetition | 1 | 4% | |
|
|
| --- |
|
|
| ## βοΈ Evaluation: Lattice-Based WER |
|
|
| Implemented a **multi-alternative bin-based lattice** where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation. |
|
|
| --- |
|
|
| ## β οΈ Limitations |
|
|
| - Optimized for conversational Hindi; may underperform on formal/broadcast speech |
| - English loanword transcription remains a known weak point |
| - Model weights not released due to proprietary training data (Josh Talks internal dataset) |
|
|
| --- |
|
|
| ## π Links |
|
|
| - π» GitHub: [Daksh159/VaaniAI](https://github.com/Daksh159/VaaniAI) |
| - π Kaggle Notebook: [josh-talks-q1-preprocessing](https://www.kaggle.com/code/daksh159/josh-talks-q1-preprocessing) |