Whisper-Medium (Fine-Tuned on Svarah Dataset)

πŸ“˜ Model Overview

This repository contains a fine-tuned and fully merged version of OpenAI’s Whisper-Medium automatic speech recognition (ASR) model.

The model was fine-tuned on the Svarah dataset, which consists of Indian-accented contact center call recordings and their corresponding transcriptions.
Fine-tuning was performed using LoRA (Low-Rank Adaptation) in PyTorch and later merged into the base Whisper-Medium weights, resulting in a single, standalone checkpoint.

βœ… No LoRA adapters required at inference
βœ… Fully offline usage
βœ… No OpenAI API dependency
βœ… Optimized for Indian English contact center speech
βœ… Compatible with PyTorch, Transformers, and CTranslate2 (after conversion)


🧠 Model Description

  • Base model: openai/whisper-medium
  • Architecture: Encoder–Decoder Transformer (Whisper)
  • Parameters: ~769M
  • Task: Automatic Speech Recognition (ASR)
  • Input: 16 kHz mono audio
  • Output: Text transcription
  • Primary Accent: Indian English
  • Domain: Contact center / customer support calls
  • Languages: English (primary), multilingual capability retained

This model improves transcription accuracy for Indian-accented conversational speech, particularly in telephony and contact center environments, while retaining Whisper’s general robustness.


🎧 Training Data

Svarah Dataset

This model was fine-tuned on the Svarah dataset, which includes:

  • Real-world contact center call recordings
  • Multiple Indian accents and dialects
  • Conversational and spontaneous speech
  • Human-verified transcription quality
  • Low-bandwidth, telephony-style audio characteristics
  • The dataset focuses on realistic customer support interactions rather than studio-quality speech.

πŸ”„ Data Processing

During fine-tuning, the following preprocessing steps were applied:

  • Audio normalized and resampled to 16 kHz mono
  • Text cleaned and standardized through normalization
  • Long call recordings segmented into Whisper-compatible chunks
  • Overlapping windows added to preserve contextual continuity
  • Noise and silence trimming applied where required

Training Details

Preprocessing Pipeline

Before training, all audio and transcripts were standardized:

Audio

  • Converted to 16 kHz mono
  • Normalized and peak-limited
  • Split into Whisper-compatible segments
  • Applied overlapping chunk windows for context retention
  • Removed long silences and excessive noise

Text

  • Lowercased and normalized
  • Removed filler markers where applicable
  • Standardized spellings for Indian English
  • Cleaned punctuation, spacing, and verbal artifacts
  • Mapped special characters to ASCII equivalents

Fine-Tuning Method

  • Parameter-Efficient Fine-Tuning using LoRA
  • Base Whisper-Medium weights were frozen
  • LoRA applied to attention projection layers:
    • q_proj, k_proj, v_proj, o_proj

LoRA Configuration

  • Rank: 16
  • Alpha: 32
  • Dropout: 0.05
  • Target Modules: Q/K/V projection layers in the Whisper transformer This allowed the model to learn accent-, vocabulary-, and domain-specific characteristics without modifying the full 769M parameter base.

Training Frameworks

  • Framework: PyTorch
  • Library: Hugging Face Transformers + PEFT (LoRA)
  • Hardware: Multi-GPU training (NVIDIA A100 / RTX 4090)
  • Precision: Mixed-precision (fp16) The combination of LoRA and mixed-precision training ensured efficient GPU utilization and stable training on large audio datasets.

Model Merging

After training:

  • LoRA adapters were merged into the base model
  • Adapters were removed using merge_and_unload
  • Final model saved as a single unified Whisper checkpoint

The final exported model is a standalone checkpoint, identical in structure to Whisper-Medium but with improved recognition performance on Indian accents.

πŸ“Š Evaluation Metrics

This model was evaluated using industry-standard transcription metrics focused on accuracy, stability, and performance in real-world telephony environments. The primary metrics used are Word Error Rate (WER) and Character Error Rate (CER).

  1. Word Error Rate (WER)

    Performance Summary

    Dataset Split WER (Whisper-Medium Base) WER (Fine-Tuned Model)
    Svarah Test Set 21.3% 13.4%
    Hinglish Subset 26.8% 15.9%
    Regional English (Gujarati, Marathi, Telugu) 27.4% 17.2%
    Noisy Telephony Audio 24.6% 14.1%

    Improvement: The fine-tuned model shows an average 34–42% relative WER reduction, especially on code-switched (Hinglish) and noisy contact center audio.

  2. Character Error Rate (CER)

    Performance Summary

    Dataset Split CER (Base) CER (Fine-Tuned)
    Svarah Test Set 13.7% 8.2%
    Hinglish Subset 17.4% 9.9%
    Regional Accents 18.2% 11.1%
    Noisy Telephony Audio 16.1% 9.0%

    The improvements in CER confirm that the fine-tuned model handles accented pronunciation, speech rate variation, and irregular spacing more effectively.

  3. Qualitative Improvements

    Accent Handling

    • Better recognition of Hindi, Gujarati, Marathi, Telugu, Tamil, and Bengali accents
    • More stable decoding of Indian English pronunciation patterns
    • Reduced errors with long or complex words Code-Switching Performance
    • Significant improvement in Hinglish transcription accuracy
    • Handles fast switching between languages with fewer substitutions Noise Robustness
    • Improved performance with low-bitrate telephony audio
    • Fewer hallucinations during background noise or overlaps
    • Better segmentation and continuity in long conversations
  4. Evaluation Methodology

    Evaluation was performed using:

    • The Svarah dataset test split
    • Additional manually curated Hinglish test samples
    • Noisy, real-world telephony recordings
    • Standard Hugging Face WER/CER evaluation scripts Transcriptions from the base Whisper-Medium model were compared directly against the fine-tuned model to measure relative performance gain.

Inference Usage

PyTorch / Transformers

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

model_name = "startelelogic/whisper-medium-ccaas"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name).to("cuda")

inputs = processor("audio.wav", return_tensors="pt", sampling_rate=16000)
with torch.no_grad():
    generated_ids = model.generate(inputs.input_features.to("cuda"))

text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)

Citation

@misc{radford2022whisper, doi = {10.48550/ARXIV.2212.04356}, url = {https://arxiv.org/abs/2212.04356}, author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya}, title = {Robust Speech Recognition via Large-Scale Weak Supervision}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

Downloads last month
56
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for startelelogic/whisper-medium-ccaas

Finetuned
(775)
this model

Paper for startelelogic/whisper-medium-ccaas