Model Card for whisper-large-v3-NSC
Model Details
Model Description
whisper-large-v3-NSC is a domain-adapted automatic speech recognition (ASR) model for conversational Singapore English (SgE).
It is a fine-tuned version of OpenAI’s Whisper Large-V3 model trained on aligned conversational speech from Part 3 of the National Speech Corpus (NSC). The model is optimized to improve the recoverability of interactional, morphosyntactic, and discourse-pragmatic features characteristic of contemporary spoken Singapore English.
The model was developed as part of a workflow for constructing YCSEP_v2 (YouTube Corpus of Singapore English Podcasts, version 2), demonstrating how targeted fine-tuning can improve linguistic fidelity in large-scale corpus creation.
- Developed by: Steven Coats, Carmelo Alessandro Basile, Cameron Morin, Robert Fuchs
- Funded by: EU NextGenerationEU / Research Council of Finland (grant 358720)
- Model type: Automatic Speech Recognition (seq2seq Transformer)
- Language(s): English (Singapore English)
- License: MIT
- Finetuned from model:
openai/whisper-large-v3
Model Sources
- Related publication: Coats et al. (2025), The YouTube Corpus of Singapore English Podcasts, English World-Wide
- Corpus produced with the model: YCSEP_v2
Uses
Direct Use
The model is intended for:
- Transcription of conversational Singapore English speech
- Linguistic corpus construction and annotation workflows
- Research in World Englishes, sociolinguistics, and interactional linguistics
- Recovering discourse particles and local morphosyntax often missed by general ASR systems
Downstream Use
The model can be integrated into pipelines involving:
- Speaker diarization (e.g., WhisperX + pyannote)
- POS tagging and syntactic annotation (e.g., spaCy pipelines)
- Construction-grammar-based corpus analysis
- Speech-based sociolinguistic research
Out-of-Scope Use
This model is not optimized for:
- Standard American/British broadcast speech
- Multilingual ASR outside the Singapore English domain
- Real-time or low-latency applications
- Legal, medical, or safety-critical transcription
Bias, Risks, and Limitations
- The model is trained narrowly on conversational Singapore English and may underperform on other English varieties.
- Training data reflects the demographic distribution of NSC Part 3 and is not fully balanced sociolinguistically.
- Domain specialization may bias outputs toward informal conversational registers.
- The model is designed for research and corpus-building, not general-purpose ASR deployment.
Recommendations
Users should evaluate performance carefully before applying the model outside conversational Singapore English contexts.
How to Get Started with the Model
from faster_whisper import WhisperModel
from huggingface_hub import snapshot_download
model_dir = snapshot_download("stcoats/whisper-large-v3-NSC")
model = WhisperModel(model_dir, device="cuda", compute_type="float16")
Training Details
Training Data
Training data was derived from National Speech Corpus (NSC) Part 3, consisting of same-room conversational recordings between two speakers (friends, partners, or family members).
Processing steps included:
- Alignment of WAV recordings with Praat TextGrid transcripts
- Removal of markup and non-speech symbols
- Merging adjacent utterances separated by ≤ 0.25 s pauses
- Segmentation into 10–30 second conversational chunks
- Export of aligned 16-bit PCM WAV segments with metadata (JSONL format)
Dataset statistics:
| Measure | Value |
|---|---|
| Speakers | 428 |
| Segments | 69,603 |
| Words | 4.57 million |
| Audio length | 458 hours |
Data split: 80% training / 20% evaluation
Training Procedure
Six Whisper model sizes were fine-tuned; this released model corresponds to the best-performing configuration.
Training Hyperparameters
- Learning rate: 2.5e-6
- Weight decay: 0.02
- Warm-up steps: 300
- Effective batch size: 16
- Epochs: 8
- Logging interval: 200 steps
- Checkpoint interval: 1,000 steps
- Training regime: bf16 mixed precision
Compute Infrastructure
Training was conducted on the LUMI supercomputer (CSC Finland) using:
- 16 × AMD Instinct MI250X GPUs (128 GB memory each)
Evaluation
Testing Data
Two evaluation datasets were used:
- MERaLiON NSC-derived test set
- Random 1,000-segment sample from held-out NSC data
Metrics
- WER (Word Error Rate)
- CER (Character Error Rate)
These metrics assess transcription fidelity for conversational speech.
Results
Fine-tuning produced substantial gains across all Whisper variants.
The best fine-tuned model slightly outperformed MERaLiON-2-10B-ASR despite being substantially smaller.
Example comparison (NSC sample):
| Model | WER | CER |
|---|---|---|
| whisper-large-v3 (baseline) | 0.3302 | 0.2452 |
| whisper-large-v2-ft | 0.2245 | 0.1599 |
| MERaLiON-2-10B-ASR | 0.2553 | 0.1844 |
Model Examination
The model improves recoverability of interactional and morphosyntactic structure, enabling more reliable extraction of constructions and discourse particles in YCSEP_v2 and related linguistic analyses.
Environmental Impact
Training used shared HPC infrastructure (LUMI), which operates with a high proportion of renewable energy.
- Hardware: AMD MI250X GPU cluster
- Training duration: 8 epochs across 16 GPUs
- Compute region: Finland (CSC)
Technical Specifications
Architecture and Objective
Sequence-to-sequence Transformer ASR model based on the Whisper architecture, optimized through supervised fine-tuning on aligned conversational speech.
Software
- PyTorch
- Hugging Face Transformers
- WhisperX + pyannote (for downstream corpus creation)
- spaCy 3.8 (linguistic annotation)
Citation
BibTeX
@article{coats2025ycsep,
author = {Coats, Steven and Basile, Carmelo Alessandro and Morin, Cameron and Fuchs, Robert},
title = {The YouTube Corpus of Singapore English Podcasts},
journal = {English World-Wide},
year = {2025},
volume = {46},
number = {3},
pages = {274--298},
doi = {10.1075/eww.25018.coa}
}
- Downloads last month
- 14
Model tree for stcoats/whisper-large-v2-NSC
Base model
openai/whisper-large-v2