Whisper-Medium (Fine-Tuned on Svarah Dataset)
π Model Overview
This repository contains a fine-tuned and fully merged version of OpenAIβs Whisper-Medium automatic speech recognition (ASR) model.
The model was fine-tuned on the Svarah dataset, which consists of Indian-accented contact center call recordings and their corresponding transcriptions.
Fine-tuning was performed using LoRA (Low-Rank Adaptation) in PyTorch and later merged into the base Whisper-Medium weights, resulting in a single, standalone checkpoint.
β
No LoRA adapters required at inference
β
Fully offline usage
β
No OpenAI API dependency
β
Optimized for Indian English contact center speech
β
Compatible with PyTorch, Transformers, and CTranslate2 (after conversion)
π§ Model Description
- Base model:
openai/whisper-medium - Architecture: EncoderβDecoder Transformer (Whisper)
- Parameters: ~769M
- Task: Automatic Speech Recognition (ASR)
- Input: 16 kHz mono audio
- Output: Text transcription
- Primary Accent: Indian English
- Domain: Contact center / customer support calls
- Languages: English (primary), multilingual capability retained
This model improves transcription accuracy for Indian-accented conversational speech, particularly in telephony and contact center environments, while retaining Whisperβs general robustness.
π§ Training Data
Svarah Dataset
This model was fine-tuned on the Svarah dataset, which includes:
- Real-world contact center call recordings
- Multiple Indian accents and dialects
- Conversational and spontaneous speech
- Human-verified transcription quality
- Low-bandwidth, telephony-style audio characteristics
- The dataset focuses on realistic customer support interactions rather than studio-quality speech.
π Data Processing
During fine-tuning, the following preprocessing steps were applied:
- Audio normalized and resampled to 16 kHz mono
- Text cleaned and standardized through normalization
- Long call recordings segmented into Whisper-compatible chunks
- Overlapping windows added to preserve contextual continuity
- Noise and silence trimming applied where required
Training Details
Preprocessing Pipeline
Before training, all audio and transcripts were standardized:
Audio
- Converted to 16 kHz mono
- Normalized and peak-limited
- Split into Whisper-compatible segments
- Applied overlapping chunk windows for context retention
- Removed long silences and excessive noise
Text
- Lowercased and normalized
- Removed filler markers where applicable
- Standardized spellings for Indian English
- Cleaned punctuation, spacing, and verbal artifacts
- Mapped special characters to ASCII equivalents
Fine-Tuning Method
- Parameter-Efficient Fine-Tuning using LoRA
- Base Whisper-Medium weights were frozen
- LoRA applied to attention projection layers:
q_proj,k_proj,v_proj,o_proj
LoRA Configuration
- Rank: 16
- Alpha: 32
- Dropout: 0.05
- Target Modules: Q/K/V projection layers in the Whisper transformer This allowed the model to learn accent-, vocabulary-, and domain-specific characteristics without modifying the full 769M parameter base.
Training Frameworks
- Framework: PyTorch
- Library: Hugging Face Transformers + PEFT (LoRA)
- Hardware: Multi-GPU training (NVIDIA A100 / RTX 4090)
- Precision: Mixed-precision (fp16) The combination of LoRA and mixed-precision training ensured efficient GPU utilization and stable training on large audio datasets.
Model Merging
After training:
- LoRA adapters were merged into the base model
- Adapters were removed using
merge_and_unload - Final model saved as a single unified Whisper checkpoint
The final exported model is a standalone checkpoint, identical in structure to Whisper-Medium but with improved recognition performance on Indian accents.
π Evaluation Metrics
This model was evaluated using industry-standard transcription metrics focused on accuracy, stability, and performance in real-world telephony environments. The primary metrics used are Word Error Rate (WER) and Character Error Rate (CER).
Word Error Rate (WER)
Performance Summary
Dataset Split WER (Whisper-Medium Base) WER (Fine-Tuned Model) Svarah Test Set 21.3% 13.4% Hinglish Subset 26.8% 15.9% Regional English (Gujarati, Marathi, Telugu) 27.4% 17.2% Noisy Telephony Audio 24.6% 14.1% Improvement: The fine-tuned model shows an average 34β42% relative WER reduction, especially on code-switched (Hinglish) and noisy contact center audio.
Character Error Rate (CER)
Performance Summary
Dataset Split CER (Base) CER (Fine-Tuned) Svarah Test Set 13.7% 8.2% Hinglish Subset 17.4% 9.9% Regional Accents 18.2% 11.1% Noisy Telephony Audio 16.1% 9.0% The improvements in CER confirm that the fine-tuned model handles accented pronunciation, speech rate variation, and irregular spacing more effectively.
Qualitative Improvements
Accent Handling
- Better recognition of Hindi, Gujarati, Marathi, Telugu, Tamil, and Bengali accents
- More stable decoding of Indian English pronunciation patterns
- Reduced errors with long or complex words Code-Switching Performance
- Significant improvement in Hinglish transcription accuracy
- Handles fast switching between languages with fewer substitutions Noise Robustness
- Improved performance with low-bitrate telephony audio
- Fewer hallucinations during background noise or overlaps
- Better segmentation and continuity in long conversations
Evaluation Methodology
Evaluation was performed using:
- The Svarah dataset test split
- Additional manually curated Hinglish test samples
- Noisy, real-world telephony recordings
- Standard Hugging Face WER/CER evaluation scripts Transcriptions from the base Whisper-Medium model were compared directly against the fine-tuned model to measure relative performance gain.
Inference Usage
PyTorch / Transformers
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
model_name = "startelelogic/whisper-medium-ccaas"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name).to("cuda")
inputs = processor("audio.wav", return_tensors="pt", sampling_rate=16000)
with torch.no_grad():
generated_ids = model.generate(inputs.input_features.to("cuda"))
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)
Citation
@misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }
- Downloads last month
- 56
Model tree for startelelogic/whisper-medium-ccaas
Base model
openai/whisper-medium