Daksh159
/

VaaniAI

+---
+language:
+  - hi
+license: apache-2.0
+base_model: openai/whisper-small
+tags:
+  - automatic-speech-recognition
+  - hindi
+  - whisper
+  - fine-tuned
+  - conversational-speech
+metrics:
+  - wer
+pipeline_tag: automatic-speech-recognition
+---
+# 🎙️ VaaniAI — Whisper-small Fine-tuned on Hindi Conversational Speech
+Fine-tuned version of `openai/whisper-small` on real-world Hindi conversational audio
+collected across **102 speakers** from India, as part of an AI Researcher Intern
+assignment at **Josh Talks**.
+---
+## 📊 Model Performance
+| Metric | Value |
+|--------|-------|
+| Baseline WER (Whisper-small) | 1.2537 |
+| Fine-tuned WER | **0.4028** |
+| WER Improvement | ↓ **67.8%** |
+| Post-processing WER gain | ↓ additional **27.7%** |
+---
+## 🗂️ Training Data
+| Property | Value |
+|----------|-------|
+| Total audio | 11.44 hours |
+| Speakers | 102 unique speakers across India |
+| Segments (after cleaning) | 4,442 |
+| Raw segments | 5,941 |
+| Train / Val split | 4,093 / 349 |
+**Cleaning steps applied:**
+- Removed 209 REDACTED-label segments
+- Removed 1,012 sub-1-second clips
+- Removed 878 segments with fewer than 5 characters
+- Resampled all audio from 44,100 Hz → 16,000 Hz
+---
+## ⚙️ Training Configuration
+| Hyperparameter | Value |
+|----------------|-------|
+| Base model | openai/whisper-small (241.7M params) |
+| Learning rate | 1e-5 |
+| Effective batch size | 32 (batch 4 × grad accum 8) |
+| Epochs | 3 |
+| Precision | FP16 |
+| Hardware | Kaggle T4 GPU (14.6 GB) |
+**Training loss progression:**
+| Epoch | Train Loss | Val Loss | WER |
+|-------|-----------|----------|-----|
+| 1 | 13.22 | 0.657 | 0.546 |
+| 2 | 6.98 | 0.471 | 0.435 |
+| 3 | 5.07 | 0.414 | **0.403** |
+---
+## 🧹 Post-processing Pipeline
+**1. Repetition Loop Detection**
+Collapses tokens repeated 4+ times — targets hallucination on noisy audio.
+Example: `आ आ आ... (100x)` → `आ`
+**2. Spelling Normalization Dictionary**
+Maps common dialectal Hindi variants to standard spellings.
+Example: `वगैरा` → `वगैरह`, `इदर` → `इधर`
+---
+## 🔍 Error Analysis (25 sampled validation errors)
+| Error Type | Count | % |
+|------------|-------|---|
+| Phonetic Confusion | 10 | 40% |
+| Spelling Variation | 7 | 28% |
+| English Loanword Error | 4 | 16% |
+| Filler Word Confusion | 3 | 12% |
+| Hallucination / Repetition | 1 | 4% |
+---
+## ⚖️ Evaluation: Lattice-Based WER
+Implemented a **multi-alternative bin-based lattice** where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation.
+---
+## ⚠️ Limitations
+- Optimized for conversational Hindi; may underperform on formal/broadcast speech
+- English loanword transcription remains a known weak point
+- Model weights not released due to proprietary training data (Josh Talks internal dataset)
+---
+## 🔗 Links
+- 💻 GitHub: [Daksh159/VaaniAI](https://github.com/Daksh159/VaaniAI)
+- 📓 Kaggle Notebook: [josh-talks-q1-preprocessing](https://www.kaggle.com/code/daksh159/josh-talks-q1-preprocessing)