--- language: - hi license: apache-2.0 base_model: openai/whisper-small tags: - automatic-speech-recognition - hindi - whisper - fine-tuned - conversational-speech metrics: - wer pipeline_tag: automatic-speech-recognition --- # đŸŽ™ī¸ VaaniAI — Whisper-small Fine-tuned on Hindi Conversational Speech Fine-tuned version of `openai/whisper-small` on real-world Hindi conversational audio collected across **102 speakers** from India, as part of an AI Researcher Intern assignment at **Josh Talks**. --- ## 📊 Model Performance | Metric | Value | |--------|-------| | Baseline WER (Whisper-small) | 1.2537 | | Fine-tuned WER | **0.4028** | | WER Improvement | ↓ **67.8%** | | Post-processing WER gain | ↓ additional **27.7%** | --- ## đŸ—‚ī¸ Training Data | Property | Value | |----------|-------| | Total audio | 11.44 hours | | Speakers | 102 unique speakers across India | | Segments (after cleaning) | 4,442 | | Raw segments | 5,941 | | Train / Val split | 4,093 / 349 | **Cleaning steps applied:** - Removed 209 REDACTED-label segments - Removed 1,012 sub-1-second clips - Removed 878 segments with fewer than 5 characters - Resampled all audio from 44,100 Hz → 16,000 Hz --- ## âš™ī¸ Training Configuration | Hyperparameter | Value | |----------------|-------| | Base model | openai/whisper-small (241.7M params) | | Learning rate | 1e-5 | | Effective batch size | 32 (batch 4 × grad accum 8) | | Epochs | 3 | | Precision | FP16 | | Hardware | Kaggle T4 GPU (14.6 GB) | **Training loss progression:** | Epoch | Train Loss | Val Loss | WER | |-------|-----------|----------|-----| | 1 | 13.22 | 0.657 | 0.546 | | 2 | 6.98 | 0.471 | 0.435 | | 3 | 5.07 | 0.414 | **0.403** | --- ## 🧹 Post-processing Pipeline **1. Repetition Loop Detection** Collapses tokens repeated 4+ times — targets hallucination on noisy audio. Example: `⤆ ⤆ ⤆... (100x)` → `⤆` **2. Spelling Normalization Dictionary** Maps common dialectal Hindi variants to standard spellings. Example: `ā¤ĩ⤗āĨˆā¤°ā¤ž` → `ā¤ĩ⤗āĨˆā¤°ā¤š`, `⤇ā¤Ļ⤰` → `⤇⤧⤰` --- ## 🔍 Error Analysis (25 sampled validation errors) | Error Type | Count | % | |------------|-------|---| | Phonetic Confusion | 10 | 40% | | Spelling Variation | 7 | 28% | | English Loanword Error | 4 | 16% | | Filler Word Confusion | 3 | 12% | | Hallucination / Repetition | 1 | 4% | --- ## âš–ī¸ Evaluation: Lattice-Based WER Implemented a **multi-alternative bin-based lattice** where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation. --- ## âš ī¸ Limitations - Optimized for conversational Hindi; may underperform on formal/broadcast speech - English loanword transcription remains a known weak point - Model weights not released due to proprietary training data (Josh Talks internal dataset) --- ## 🔗 Links - đŸ’ģ GitHub: [Daksh159/VaaniAI](https://github.com/Daksh159/VaaniAI) - 📓 Kaggle Notebook: [josh-talks-q1-preprocessing](https://www.kaggle.com/code/daksh159/josh-talks-q1-preprocessing)