Daksh159
/

VaaniAI

Automatic Speech Recognition

conversational-speech

Model card Files Files and versions

VaaniAI / README.md

Daksh159's picture

Update README.md

3a897f3 verified 10 days ago

|

history blame contribute delete

3.07 kB

	---
	language:
	- hi
	license: apache-2.0
	base_model: openai/whisper-small
	tags:
	- automatic-speech-recognition
	- hindi
	- whisper
	- fine-tuned
	- conversational-speech
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	---

	# 🎙️ VaaniAI — Whisper-small Fine-tuned on Hindi Conversational Speech

	Fine-tuned version of `openai/whisper-small` on real-world Hindi conversational audio
	collected across 102 speakers from India, as part of an AI Researcher Intern
	assignment at Josh Talks.

	---

	## 📊 Model Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Baseline WER (Whisper-small) \| 1.2537 \|
	\| Fine-tuned WER \| 0.4028 \|
	\| WER Improvement \| ↓ 67.8% \|
	\| Post-processing WER gain \| ↓ additional 27.7% \|

	---

	## 🗂️ Training Data

	\| Property \| Value \|
	\|----------\|-------\|
	\| Total audio \| 11.44 hours \|
	\| Speakers \| 102 unique speakers across India \|
	\| Segments (after cleaning) \| 4,442 \|
	\| Raw segments \| 5,941 \|
	\| Train / Val split \| 4,093 / 349 \|

	Cleaning steps applied:
	- Removed 209 REDACTED-label segments
	- Removed 1,012 sub-1-second clips
	- Removed 878 segments with fewer than 5 characters
	- Resampled all audio from 44,100 Hz → 16,000 Hz

	---

	## ⚙️ Training Configuration

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Base model \| openai/whisper-small (241.7M params) \|
	\| Learning rate \| 1e-5 \|
	\| Effective batch size \| 32 (batch 4 × grad accum 8) \|
	\| Epochs \| 3 \|
	\| Precision \| FP16 \|
	\| Hardware \| Kaggle T4 GPU (14.6 GB) \|

	Training loss progression:

	\| Epoch \| Train Loss \| Val Loss \| WER \|
	\|-------\|-----------\|----------\|-----\|
	\| 1 \| 13.22 \| 0.657 \| 0.546 \|
	\| 2 \| 6.98 \| 0.471 \| 0.435 \|
	\| 3 \| 5.07 \| 0.414 \| 0.403 \|

	---

	## 🧹 Post-processing Pipeline

	1. Repetition Loop Detection
	Collapses tokens repeated 4+ times — targets hallucination on noisy audio.
	Example: `आ आ आ... (100x)` → `आ`

	2. Spelling Normalization Dictionary
	Maps common dialectal Hindi variants to standard spellings.
	Example: `वगैरा` → `वगैरह`, `इदर` → `इधर`

	---

	## 🔍 Error Analysis (25 sampled validation errors)

	\| Error Type \| Count \| % \|
	\|------------\|-------\|---\|
	\| Phonetic Confusion \| 10 \| 40% \|
	\| Spelling Variation \| 7 \| 28% \|
	\| English Loanword Error \| 4 \| 16% \|
	\| Filler Word Confusion \| 3 \| 12% \|
	\| Hallucination / Repetition \| 1 \| 4% \|

	---

	## ⚖️ Evaluation: Lattice-Based WER

	Implemented a multi-alternative bin-based lattice where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation.

	---

	## ⚠️ Limitations

	- Optimized for conversational Hindi; may underperform on formal/broadcast speech
	- English loanword transcription remains a known weak point
	- Model weights not released due to proprietary training data (Josh Talks internal dataset)

	---

	## 🔗 Links

	- 💻 GitHub: [Daksh159/VaaniAI](https://github.com/Daksh159/VaaniAI)
	- 📓 Kaggle Notebook: [josh-talks-q1-preprocessing](https://www.kaggle.com/code/daksh159/josh-talks-q1-preprocessing)