language:
- en
license: other
tags:
- whisper
- ctranslate2
- automatic-speech-recognition
- air-traffic-control
- atc
- singapore
- military
- faster-whisper
base_model: openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
metrics:
- wer
model-index:
- name: whisper-large-v3-atc-singapore
results:
- task:
type: automatic-speech-recognition
metrics:
- name: WER
type: wer
value: 0.66
Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)
Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition.
Performance
| Run | WER | Base | Data | Key Change |
|---|---|---|---|---|
| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune |
| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay |
| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder |
| ct2_run8 | 0.66% | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation |
Note: ct2_run8 starts from the original
openai/whisper-large-v3base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.
Model Details
| Key | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Format | CTranslate2 float16 |
| Size | 2.9 GB |
| Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) |
| Best WER | 0.66% (epoch 6) |
| Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |
Training
- Full fine-tune from
openai/whisper-large-v3(encoder + decoder) - Optimizer: AdamW 8-bit (bitsandbytes)
- Learning rate: 1e-5 with linear schedule, 5% warmup
- Effective batch size: 16 (1 per device x 16 gradient accumulation)
- Mixed precision: fp16
- Gradient checkpointing: enabled
- Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)
See hyperparameters.md for full training configuration.
Augmentation
- Gaussian noise (p=0.4, amplitude 0.001-0.015)
- Time stretch (p=0.3, rate 0.9-1.1)
- Random silence padding (p=0.5, 0-0.7s each end)
- BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
- Clipping (p=0.2, +/-0.8)
- MP3 compression (p=0.3, 32-64 kbps)
- SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)
Results
| Epoch | Eval loss | WER |
|---|---|---|
| 1.0 | 0.0496 | 3.46% |
| 2.0 | 0.0288 | 1.84% |
| 3.0 | 0.0239 | 0.82% |
| 4.0 | 0.0245 | 1.55% |
| 5.0 | 0.0195 | 0.92% |
| 6.0 | 0.0231 | 0.66% |
| 7.0 | 0.0199 | 0.70% |
| 8.0 | 0.0211 | 2.62% |
| 9.0 | 0.0191 | 0.72% |
| 10.0 | 0.0186 | 4.43% |
| 11.0 | 0.0172 | 0.69% |
Usage
from faster_whisper import WhisperModel
model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16")
segments, info = model.transcribe(
"audio.wav",
language="en",
beam_size=5,
hotwords=(
"tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
"sembawang macritchie johor tekong batam hosba sijan changi nylon "
"arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
"qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
"glidepath centreline talkdown sigmet cavok colour "
"downwind crosswind upwind abeam initials pitchout "
"mekong taipan kingcup scorpion scallop termite carlton snakefly "
"basking pelican cobra earlgrey bluebell maverick wolfman stinger "
"jaguar lancer niner decimal flight level runway"
),
)
text = " ".join(seg.text.strip() for seg in segments)
# "camel cleared i l s approach runway three six"
Output Format
The model outputs normalized spoken text (lowercase, fully expanded):
| Input audio says | Model outputs |
|---|---|
| "CAMEL climb flight level zero nine zero" | camel climb flight level zero nine zero |
| "Contact Tengah Approach one three zero decimal zero" | contact tengah approach one three zero decimal zero |
| "Squawk seven seven zero zero" | squawk seven seven zero zero |
A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., CAMEL climb FL090). See the ASTRA simpilot pipeline for the full integration.