Whisper-Medium (Fine-Tuned on Svarah Dataset)

📘 Model Overview

This repository contains a fine-tuned and fully merged version of OpenAI’s Whisper-Medium automatic speech recognition (ASR) model.

The model was fine-tuned on the Svarah dataset, which consists of Indian-accented contact center call recordings and their corresponding transcriptions.
Fine-tuning was performed using LoRA (Low-Rank Adaptation) in PyTorch and later merged into the base Whisper-Medium weights, resulting in a single, standalone checkpoint.

✅ No LoRA adapters required at inference
✅ Fully offline usage
✅ No OpenAI API dependency
✅ Optimized for Indian English contact center speech
✅ Compatible with PyTorch, Transformers, and CTranslate2 (after conversion)

🧠 Model Description

Base model: openai/whisper-medium
Architecture: Encoder–Decoder Transformer (Whisper)
Parameters: ~769M
Task: Automatic Speech Recognition (ASR)
Input: 16 kHz mono audio
Output: Text transcription
Primary Accent: Indian English
Domain: Contact center / customer support calls
Languages: English (primary), multilingual capability retained

This model improves transcription accuracy for Indian-accented conversational speech, particularly in telephony and contact center environments, while retaining Whisper’s general robustness.

🎧 Training Data

Svarah Dataset

This model was fine-tuned on the Svarah dataset, which includes:

Real-world contact center call recordings
Multiple Indian accents and dialects
Conversational and spontaneous speech
Human-verified transcription quality
Low-bandwidth, telephony-style audio characteristics
The dataset focuses on realistic customer support interactions rather than studio-quality speech.

🔄 Data Processing

During fine-tuning, the following preprocessing steps were applied:

Audio normalized and resampled to 16 kHz mono
Text cleaned and standardized through normalization
Long call recordings segmented into Whisper-compatible chunks
Overlapping windows added to preserve contextual continuity
Noise and silence trimming applied where required

Training Details

Preprocessing Pipeline

Before training, all audio and transcripts were standardized:

Audio

Converted to 16 kHz mono
Normalized and peak-limited
Split into Whisper-compatible segments
Applied overlapping chunk windows for context retention
Removed long silences and excessive noise

Text

Lowercased and normalized
Removed filler markers where applicable
Standardized spellings for Indian English
Cleaned punctuation, spacing, and verbal artifacts
Mapped special characters to ASCII equivalents

Fine-Tuning Method

Parameter-Efficient Fine-Tuning using LoRA
Base Whisper-Medium weights were frozen
LoRA applied to attention projection layers:
- q_proj, k_proj, v_proj, o_proj

LoRA Configuration

Rank: 16
Alpha: 32
Dropout: 0.05
Target Modules: Q/K/V projection layers in the Whisper transformer This allowed the model to learn accent-, vocabulary-, and domain-specific characteristics without modifying the full 769M parameter base.

Training Frameworks

Framework: PyTorch
Library: Hugging Face Transformers + PEFT (LoRA)
Hardware: Multi-GPU training (NVIDIA A100 / RTX 4090)
Precision: Mixed-precision (fp16) The combination of LoRA and mixed-precision training ensured efficient GPU utilization and stable training on large audio datasets.

Model Merging

After training:

LoRA adapters were merged into the base model
Adapters were removed using merge_and_unload
Final model saved as a single unified Whisper checkpoint

The final exported model is a standalone checkpoint, identical in structure to Whisper-Medium but with improved recognition performance on Indian accents.

📊 Evaluation Metrics

This model was evaluated using industry-standard transcription metrics focused on accuracy, stability, and performance in real-world telephony environments. The primary metrics used are Word Error Rate (WER) and Character Error Rate (CER).

Word Error Rate (WER)

Performance Summary

Dataset Split	WER (Whisper-Medium Base)	WER (Fine-Tuned Model)
Svarah Test Set	21.3%	13.4%
Hinglish Subset	26.8%	15.9%
Regional English (Gujarati, Marathi, Telugu)	27.4%	17.2%
Noisy Telephony Audio	24.6%	14.1%

Improvement: The fine-tuned model shows an average 34–42% relative WER reduction, especially on code-switched (Hinglish) and noisy contact center audio.

Character Error Rate (CER)

Performance Summary

Dataset Split CER (Base) CER (Fine-Tuned)

Svarah Test Set 13.7% 8.2%

Hinglish Subset 17.4% 9.9%

Regional Accents 18.2% 11.1%

Noisy Telephony Audio 16.1% 9.0%

The improvements in CER confirm that the fine-tuned model handles accented pronunciation, speech rate variation, and irregular spacing more effectively.
Qualitative Improvements

Accent Handling
- Better recognition of Hindi, Gujarati, Marathi, Telugu, Tamil, and Bengali accents
- More stable decoding of Indian English pronunciation patterns
- Reduced errors with long or complex words Code-Switching Performance
- Significant improvement in Hinglish transcription accuracy
- Handles fast switching between languages with fewer substitutions Noise Robustness
- Improved performance with low-bitrate telephony audio
- Fewer hallucinations during background noise or overlaps
- Better segmentation and continuity in long conversations
Evaluation Methodology

Evaluation was performed using:
- The Svarah dataset test split
- Additional manually curated Hinglish test samples
- Noisy, real-world telephony recordings
- Standard Hugging Face WER/CER evaluation scripts Transcriptions from the base Whisper-Medium model were compared directly against the fine-tuned model to measure relative performance gain.

Dataset Split	CER (Base)	CER (Fine-Tuned)
Svarah Test Set	13.7%	8.2%
Hinglish Subset	17.4%	9.9%
Regional Accents	18.2%	11.1%
Noisy Telephony Audio	16.1%	9.0%

Inference Usage

PyTorch / Transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

model_name = "startelelogic/whisper-medium-ccaas"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name).to("cuda")

inputs = processor("audio.wav", return_tensors="pt", sampling_rate=16000)
with torch.no_grad():
    generated_ids = model.generate(inputs.input_features.to("cuda"))

text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

Citation

@misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

Downloads last month: 7

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for startelelogic/whisper-medium-ccaas

Base model

openai/whisper-medium

Finetuned

(874)

this model

Paper for startelelogic/whisper-medium-ccaas

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 53