VaaniAI / README.md
Daksh159's picture
Update README.md
3a897f3 verified
metadata
language:
  - hi
license: apache-2.0
base_model: openai/whisper-small
tags:
  - automatic-speech-recognition
  - hindi
  - whisper
  - fine-tuned
  - conversational-speech
metrics:
  - wer
pipeline_tag: automatic-speech-recognition

πŸŽ™οΈ VaaniAI β€” Whisper-small Fine-tuned on Hindi Conversational Speech

Fine-tuned version of openai/whisper-small on real-world Hindi conversational audio collected across 102 speakers from India, as part of an AI Researcher Intern assignment at Josh Talks.


πŸ“Š Model Performance

Metric Value
Baseline WER (Whisper-small) 1.2537
Fine-tuned WER 0.4028
WER Improvement ↓ 67.8%
Post-processing WER gain ↓ additional 27.7%

πŸ—‚οΈ Training Data

Property Value
Total audio 11.44 hours
Speakers 102 unique speakers across India
Segments (after cleaning) 4,442
Raw segments 5,941
Train / Val split 4,093 / 349

Cleaning steps applied:

  • Removed 209 REDACTED-label segments
  • Removed 1,012 sub-1-second clips
  • Removed 878 segments with fewer than 5 characters
  • Resampled all audio from 44,100 Hz β†’ 16,000 Hz

βš™οΈ Training Configuration

Hyperparameter Value
Base model openai/whisper-small (241.7M params)
Learning rate 1e-5
Effective batch size 32 (batch 4 Γ— grad accum 8)
Epochs 3
Precision FP16
Hardware Kaggle T4 GPU (14.6 GB)

Training loss progression:

Epoch Train Loss Val Loss WER
1 13.22 0.657 0.546
2 6.98 0.471 0.435
3 5.07 0.414 0.403

🧹 Post-processing Pipeline

1. Repetition Loop Detection Collapses tokens repeated 4+ times β€” targets hallucination on noisy audio. Example: ΰ€† ΰ€† ΰ€†... (100x) β†’ ΰ€†

2. Spelling Normalization Dictionary Maps common dialectal Hindi variants to standard spellings. Example: ΰ€΅ΰ€—ΰ₯ˆΰ€°ΰ€Ύ β†’ ΰ€΅ΰ€—ΰ₯ˆΰ€°ΰ€Ή, ΰ€‡ΰ€¦ΰ€° β†’ ΰ€‡ΰ€§ΰ€°


πŸ” Error Analysis (25 sampled validation errors)

Error Type Count %
Phonetic Confusion 10 40%
Spelling Variation 7 28%
English Loanword Error 4 16%
Filler Word Confusion 3 12%
Hallucination / Repetition 1 4%

βš–οΈ Evaluation: Lattice-Based WER

Implemented a multi-alternative bin-based lattice where each position accepts all valid alternatives (numeric, synonymous, dialectal) for fairer evaluation.


⚠️ Limitations

  • Optimized for conversational Hindi; may underperform on formal/broadcast speech
  • English loanword transcription remains a known weak point
  • Model weights not released due to proprietary training data (Josh Talks internal dataset)

πŸ”— Links