Model Card for whisper-large-v3-NSC

Model Details

Model Description

whisper-large-v3-NSC is a domain-adapted automatic speech recognition (ASR) model for conversational Singapore English (SgE).
It is a fine-tuned version of OpenAI’s Whisper Large-V3 model trained on aligned conversational speech from Part 3 of the National Speech Corpus (NSC). The model is optimized to improve the recoverability of interactional, morphosyntactic, and discourse-pragmatic features characteristic of contemporary spoken Singapore English.

The model was developed as part of a workflow for constructing YCSEP_v2 (YouTube Corpus of Singapore English Podcasts, version 2), demonstrating how targeted fine-tuning can improve linguistic fidelity in large-scale corpus creation.

  • Developed by: Steven Coats, Carmelo Alessandro Basile, Cameron Morin, Robert Fuchs
  • Funded by: EU NextGenerationEU / Research Council of Finland (grant 358720)
  • Model type: Automatic Speech Recognition (seq2seq Transformer)
  • Language(s): English (Singapore English)
  • License: MIT
  • Finetuned from model: openai/whisper-large-v3

Model Sources

  • Related publication: Coats et al. (2025), The YouTube Corpus of Singapore English Podcasts, English World-Wide
  • Corpus produced with the model: YCSEP_v2

Uses

Direct Use

The model is intended for:

  • Transcription of conversational Singapore English speech
  • Linguistic corpus construction and annotation workflows
  • Research in World Englishes, sociolinguistics, and interactional linguistics
  • Recovering discourse particles and local morphosyntax often missed by general ASR systems

Downstream Use

The model can be integrated into pipelines involving:

  • Speaker diarization (e.g., WhisperX + pyannote)
  • POS tagging and syntactic annotation (e.g., spaCy pipelines)
  • Construction-grammar-based corpus analysis
  • Speech-based sociolinguistic research

Out-of-Scope Use

This model is not optimized for:

  • Standard American/British broadcast speech
  • Multilingual ASR outside the Singapore English domain
  • Real-time or low-latency applications
  • Legal, medical, or safety-critical transcription

Bias, Risks, and Limitations

  • The model is trained narrowly on conversational Singapore English and may underperform on other English varieties.
  • Training data reflects the demographic distribution of NSC Part 3 and is not fully balanced sociolinguistically.
  • Domain specialization may bias outputs toward informal conversational registers.
  • The model is designed for research and corpus-building, not general-purpose ASR deployment.

Recommendations

Users should evaluate performance carefully before applying the model outside conversational Singapore English contexts.


How to Get Started with the Model

from faster_whisper import WhisperModel
from huggingface_hub import snapshot_download

model_dir = snapshot_download("stcoats/whisper-large-v3-NSC")
model = WhisperModel(model_dir, device="cuda", compute_type="float16")

Training Details

Training Data

Training data was derived from National Speech Corpus (NSC) Part 3, consisting of same-room conversational recordings between two speakers (friends, partners, or family members).

Processing steps included:

  • Alignment of WAV recordings with Praat TextGrid transcripts
  • Removal of markup and non-speech symbols
  • Merging adjacent utterances separated by ≤ 0.25 s pauses
  • Segmentation into 10–30 second conversational chunks
  • Export of aligned 16-bit PCM WAV segments with metadata (JSONL format)

Dataset statistics:

Measure Value
Speakers 428
Segments 69,603
Words 4.57 million
Audio length 458 hours

Data split: 80% training / 20% evaluation


Training Procedure

Six Whisper model sizes were fine-tuned; this released model corresponds to the best-performing configuration.

Training Hyperparameters

  • Learning rate: 2.5e-6
  • Weight decay: 0.02
  • Warm-up steps: 300
  • Effective batch size: 16
  • Epochs: 8
  • Logging interval: 200 steps
  • Checkpoint interval: 1,000 steps
  • Training regime: bf16 mixed precision

Compute Infrastructure

Training was conducted on the LUMI supercomputer (CSC Finland) using:

  • 16 × AMD Instinct MI250X GPUs (128 GB memory each)

Evaluation

Testing Data

Two evaluation datasets were used:

  • MERaLiON NSC-derived test set
  • Random 1,000-segment sample from held-out NSC data

Metrics

  • WER (Word Error Rate)
  • CER (Character Error Rate)

These metrics assess transcription fidelity for conversational speech.

Results

Fine-tuning produced substantial gains across all Whisper variants.
The best fine-tuned model slightly outperformed MERaLiON-2-10B-ASR despite being substantially smaller.

Example comparison (NSC sample):

Model WER CER
whisper-large-v3 (baseline) 0.3302 0.2452
whisper-large-v2-ft 0.2245 0.1599
MERaLiON-2-10B-ASR 0.2553 0.1844

Model Examination

The model improves recoverability of interactional and morphosyntactic structure, enabling more reliable extraction of constructions and discourse particles in YCSEP_v2 and related linguistic analyses.


Environmental Impact

Training used shared HPC infrastructure (LUMI), which operates with a high proportion of renewable energy.

  • Hardware: AMD MI250X GPU cluster
  • Training duration: 8 epochs across 16 GPUs
  • Compute region: Finland (CSC)

Technical Specifications

Architecture and Objective

Sequence-to-sequence Transformer ASR model based on the Whisper architecture, optimized through supervised fine-tuning on aligned conversational speech.

Software

  • PyTorch
  • Hugging Face Transformers
  • WhisperX + pyannote (for downstream corpus creation)
  • spaCy 3.8 (linguistic annotation)

Citation

BibTeX

@article{coats2025ycsep,
  author = {Coats, Steven and Basile, Carmelo Alessandro and Morin, Cameron and Fuchs, Robert},
  title = {The YouTube Corpus of Singapore English Podcasts},
  journal = {English World-Wide},
  year = {2025},
  volume = {46},
  number = {3},
  pages = {274--298},
  doi = {10.1075/eww.25018.coa}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stcoats/whisper-large-v2-NSC

Finetuned
(249)
this model