Daksh159 commited on
Commit
3a897f3
·
verified ·
1 Parent(s): d933bc2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -3
README.md CHANGED
@@ -1,3 +1,116 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - hi
4
+ license: apache-2.0
5
+ base_model: openai/whisper-small
6
+ tags:
7
+ - automatic-speech-recognition
8
+ - hindi
9
+ - whisper
10
+ - fine-tuned
11
+ - conversational-speech
12
+ metrics:
13
+ - wer
14
+ pipeline_tag: automatic-speech-recognition
15
+ ---
16
+
17
+ # 🎙️ VaaniAI — Whisper-small Fine-tuned on Hindi Conversational Speech
18
+
19
+ Fine-tuned version of `openai/whisper-small` on real-world Hindi conversational audio
20
+ collected across **102 speakers** from India, as part of an AI Researcher Intern
21
+ assignment at **Josh Talks**.
22
+
23
+ ---
24
+
25
+ ## 📊 Model Performance
26
+
27
+ | Metric | Value |
28
+ |--------|-------|
29
+ | Baseline WER (Whisper-small) | 1.2537 |
30
+ | Fine-tuned WER | **0.4028** |
31
+ | WER Improvement | ↓ **67.8%** |
32
+ | Post-processing WER gain | ↓ additional **27.7%** |
33
+
34
+ ---
35
+
36
+ ## 🗂️ Training Data
37
+
38
+ | Property | Value |
39
+ |----------|-------|
40
+ | Total audio | 11.44 hours |
41
+ | Speakers | 102 unique speakers across India |
42
+ | Segments (after cleaning) | 4,442 |
43
+ | Raw segments | 5,941 |
44
+ | Train / Val split | 4,093 / 349 |
45
+
46
+ **Cleaning steps applied:**
47
+ - Removed 209 REDACTED-label segments
48
+ - Removed 1,012 sub-1-second clips
49
+ - Removed 878 segments with fewer than 5 characters
50
+ - Resampled all audio from 44,100 Hz → 16,000 Hz
51
+
52
+ ---
53
+
54
+ ## ⚙️ Training Configuration
55
+
56
+ | Hyperparameter | Value |
57
+ |----------------|-------|
58
+ | Base model | openai/whisper-small (241.7M params) |
59
+ | Learning rate | 1e-5 |
60
+ | Effective batch size | 32 (batch 4 × grad accum 8) |
61
+ | Epochs | 3 |
62
+ | Precision | FP16 |
63
+ | Hardware | Kaggle T4 GPU (14.6 GB) |
64
+
65
+ **Training loss progression:**
66
+
67
+ | Epoch | Train Loss | Val Loss | WER |
68
+ |-------|-----------|----------|-----|
69
+ | 1 | 13.22 | 0.657 | 0.546 |
70
+ | 2 | 6.98 | 0.471 | 0.435 |
71
+ | 3 | 5.07 | 0.414 | **0.403** |
72
+
73
+ ---
74
+
75
+ ## 🧹 Post-processing Pipeline
76
+
77
+ **1. Repetition Loop Detection**
78
+ Collapses tokens repeated 4+ times — targets hallucination on noisy audio.
79
+ Example: `आ आ आ... (100x)` → `आ`
80
+
81
+ **2. Spelling Normalization Dictionary**
82
+ Maps common dialectal Hindi variants to standard spellings.
83
+ Example: `वगैरा` → `वगैरह`, `इदर` → `इधर`
84
+
85
+ ---
86
+
87
+ ## 🔍 Error Analysis (25 sampled validation errors)
88
+
89
+ | Error Type | Count | % |
90
+ |------------|-------|---|
91
+ | Phonetic Confusion | 10 | 40% |
92
+ | Spelling Variation | 7 | 28% |
93
+ | English Loanword Error | 4 | 16% |
94
+ | Filler Word Confusion | 3 | 12% |
95
+ | Hallucination / Repetition | 1 | 4% |
96
+
97
+ ---
98
+
99
+ ## ⚖️ Evaluation: Lattice-Based WER
100
+
101
+ Implemented a **multi-alternative bin-based lattice** where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation.
102
+
103
+ ---
104
+
105
+ ## ⚠️ Limitations
106
+
107
+ - Optimized for conversational Hindi; may underperform on formal/broadcast speech
108
+ - English loanword transcription remains a known weak point
109
+ - Model weights not released due to proprietary training data (Josh Talks internal dataset)
110
+
111
+ ---
112
+
113
+ ## 🔗 Links
114
+
115
+ - 💻 GitHub: [Daksh159/VaaniAI](https://github.com/Daksh159/VaaniAI)
116
+ - 📓 Kaggle Notebook: [josh-talks-q1-preprocessing](https://www.kaggle.com/code/daksh159/josh-talks-q1-preprocessing)