VaaniAI / README.md

Daksh159

Update README.md

3a897f3 verified 10 days ago

preview code

raw

history blame contribute delete

3.07 kB

metadata

language:
  - hi
license: apache-2.0
base_model: openai/whisper-small
tags:
  - automatic-speech-recognition
  - hindi
  - whisper
  - fine-tuned
  - conversational-speech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition

🎙️ VaaniAI — Whisper-small Fine-tuned on Hindi Conversational Speech

Fine-tuned version of openai/whisper-small on real-world Hindi conversational audio collected across 102 speakers from India, as part of an AI Researcher Intern assignment at Josh Talks.

📊 Model Performance

Metric	Value
Baseline WER (Whisper-small)	1.2537
Fine-tuned WER	0.4028
WER Improvement	↓ 67.8%
Post-processing WER gain	↓ additional 27.7%

🗂️ Training Data

Property	Value
Total audio	11.44 hours
Speakers	102 unique speakers across India
Segments (after cleaning)	4,442
Raw segments	5,941
Train / Val split	4,093 / 349

Cleaning steps applied:

Removed 209 REDACTED-label segments
Removed 1,012 sub-1-second clips
Removed 878 segments with fewer than 5 characters
Resampled all audio from 44,100 Hz → 16,000 Hz

⚙️ Training Configuration

Hyperparameter	Value
Base model	openai/whisper-small (241.7M params)
Learning rate	1e-5
Effective batch size	32 (batch 4 × grad accum 8)
Epochs	3
Precision	FP16
Hardware	Kaggle T4 GPU (14.6 GB)

Training loss progression:

Epoch	Train Loss	Val Loss	WER
1	13.22	0.657	0.546
2	6.98	0.471	0.435
3	5.07	0.414	0.403

🧹 Post-processing Pipeline

1. Repetition Loop Detection Collapses tokens repeated 4+ times — targets hallucination on noisy audio. Example: आ आ आ... (100x) → आ

2. Spelling Normalization Dictionary Maps common dialectal Hindi variants to standard spellings. Example: वगैरा → वगैरह, इदर → इधर

🔍 Error Analysis (25 sampled validation errors)

Error Type	Count	%
Phonetic Confusion	10	40%
Spelling Variation	7	28%
English Loanword Error	4	16%
Filler Word Confusion	3	12%
Hallucination / Repetition	1	4%

⚖️ Evaluation: Lattice-Based WER

Implemented a multi-alternative bin-based lattice where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation.

⚠️ Limitations

Optimized for conversational Hindi; may underperform on formal/broadcast speech
English loanword transcription remains a known weak point
Model weights not released due to proprietary training data (Josh Talks internal dataset)

🔗 Links

💻 GitHub: Daksh159/VaaniAI
📓 Kaggle Notebook: josh-talks-q1-preprocessing