Model Card for `whisper-large-v3-NSC`

Model Details

Model Description

whisper-large-v2-NSC is a domain-adapted automatic speech recognition (ASR) model for conversational Singapore English (SgE).
It is a fine-tuned version of OpenAI’s Whisper Large-V3 model trained on aligned conversational speech from Part 3 of the National Speech Corpus (NSC). The model is optimized to improve the recoverability of interactional, morphosyntactic, and discourse-pragmatic features characteristic of contemporary spoken Singapore English.

The model was developed as part of a workflow for constructing YCSEP_v2 (YouTube Corpus of Singapore English Podcasts, version 2), demonstrating how targeted fine-tuning can improve linguistic fidelity in large-scale corpus creation.

Developed by: Steven Coats, Carmelo Alessandro Basile, Cameron Morin, Robert Fuchs
Funded by: EU NextGenerationEU / Research Council of Finland (grant 358720)
Model type: Automatic Speech Recognition (seq2seq Transformer)
Language(s): English (Singapore English)
License: MIT
Finetuned from model: openai/whisper-large-v2

Model Sources

Related publication: Coats et al. (2025), The YouTube Corpus of Singapore English Podcasts, English World-Wide
Corpus produced with the model: YCSEP_v2

Uses

Direct Use

The model is intended for:

Transcription of conversational Singapore English speech
Linguistic corpus construction and annotation workflows
Research in World Englishes, sociolinguistics, and interactional linguistics
Recovering discourse particles and local morphosyntax often missed by general ASR systems

Downstream Use

The model can be integrated into pipelines involving:

Speaker diarization (e.g., WhisperX + pyannote)
POS tagging and syntactic annotation (e.g., spaCy pipelines)
Construction-grammar-based corpus analysis
Speech-based sociolinguistic research

Out-of-Scope Use

This model is not optimized for:

Standard American/British broadcast speech
Multilingual ASR outside the Singapore English domain
Real-time or low-latency applications
Legal, medical, or safety-critical transcription

Bias, Risks, and Limitations

The model is trained narrowly on conversational Singapore English and may underperform on other English varieties.
Training data reflects the demographic distribution of NSC Part 3 and is not fully balanced sociolinguistically.
Domain specialization may bias outputs toward informal conversational registers.
The model is designed for research and corpus-building, not general-purpose ASR deployment.

Recommendations

Users should evaluate performance carefully before applying the model outside conversational Singapore English contexts.

How to Get Started with the Model

from faster_whisper import WhisperModel
from huggingface_hub import snapshot_download

model_dir = snapshot_download("stcoats/whisper-large-v2-NSC")
model = WhisperModel(model_dir, device="cuda", compute_type="float16")

Training Details

Training Data

Training data was derived from National Speech Corpus (NSC) Part 3, consisting of same-room conversational recordings between two speakers (friends, partners, or family members).

Processing steps included:

Alignment of WAV recordings with Praat TextGrid transcripts
Removal of markup and non-speech symbols
Merging adjacent utterances separated by ≤ 0.25 s pauses
Segmentation into 10–30 second conversational chunks
Export of aligned 16-bit PCM WAV segments with metadata (JSONL format)

Dataset statistics:

Measure	Value
Speakers	428
Segments	69,603
Words	4.57 million
Audio length	458 hours

Data split: 80% training / 20% evaluation

Training Procedure

Six Whisper model sizes were fine-tuned; this released model corresponds to the best-performing configuration.

Training Hyperparameters

Learning rate: 2.5e-6
Weight decay: 0.02
Warm-up steps: 300
Effective batch size: 16
Epochs: 8
Logging interval: 200 steps
Checkpoint interval: 1,000 steps
Training regime: bf16 mixed precision

Compute Infrastructure

Training was conducted on the LUMI supercomputer (CSC Finland) using:

16 × AMD Instinct MI250X GPUs (128 GB memory each)

Evaluation

Testing Data

Two evaluation datasets were used:

MERaLiON NSC-derived test set
Random 1,000-segment sample from held-out NSC data

Metrics

WER (Word Error Rate)
CER (Character Error Rate)

These metrics assess transcription fidelity for conversational speech.

Results

Fine-tuning produced substantial gains across all Whisper variants.
The best fine-tuned model slightly outperformed MERaLiON-2-10B-ASR despite being substantially smaller.

Example comparison (NSC sample):

Model	WER	CER
whisper-large-v3 (baseline)	0.3302	0.2452
whisper-large-v2-ft	0.2245	0.1599
MERaLiON-2-10B-ASR	0.2553	0.1844

Model Examination

The model improves recoverability of interactional and morphosyntactic structure, enabling more reliable extraction of constructions and discourse particles in YCSEP_v2 and related linguistic analyses.

Environmental Impact

Training used shared HPC infrastructure (LUMI), which operates with a high proportion of renewable energy.

Hardware: AMD MI250X GPU cluster
Training duration: 8 epochs across 16 GPUs
Compute region: Finland (CSC)

Technical Specifications

Architecture and Objective

Sequence-to-sequence Transformer ASR model based on the Whisper architecture, optimized through supervised fine-tuning on aligned conversational speech.

Software

PyTorch
Hugging Face Transformers
WhisperX + pyannote (for downstream corpus creation)
spaCy 3.8 (linguistic annotation)

Citation

BibTeX

@article{coats2025ycsep,
  author = {Coats, Steven and Basile, Carmelo Alessandro and Morin, Cameron and Fuchs, Robert},
  title = {The YouTube Corpus of Singapore English Podcasts},
  journal = {English World-Wide},
  year = {2025},
  volume = {46},
  number = {3},
  pages = {274--298},
  doi = {10.1075/eww.25018.coa}
}

Downloads last month: 1

Model tree for stcoats/whisper-large-v2-NSC

Base model

openai/whisper-large-v2

Finetuned

(289)

this model

Model Card for whisper-large-v3-NSC