--- language: - en license: other tags: - whisper - ctranslate2 - automatic-speech-recognition - air-traffic-control - atc - singapore - military - faster-whisper base_model: openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition metrics: - wer model-index: - name: whisper-large-v3-atc-singapore results: - task: type: automatic-speech-recognition metrics: - name: WER type: wer value: 0.66 --- # Whisper Large v3 — Singapore Military ATC (CTranslate2 float16) Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition. ## Performance | Run | WER | Base | Data | Key Change | |-----|-----|------|------|------------| | ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune | | ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay | | ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder | | **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation | > **Note:** ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation. ## Model Details | Key | Value | |-----|-------| | Base model | `openai/whisper-large-v3` | | Format | CTranslate2 float16 | | Size | 2.9 GB | | Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) | | Best WER | 0.66% (epoch 6) | | Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) | ## Training - **Full fine-tune** from `openai/whisper-large-v3` (encoder + decoder) - Optimizer: AdamW 8-bit (bitsandbytes) - Learning rate: 1e-5 with linear schedule, 5% warmup - Effective batch size: 16 (1 per device x 16 gradient accumulation) - Mixed precision: fp16 - Gradient checkpointing: enabled - Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6) See [hyperparameters.md](./hyperparameters.md) for full training configuration. ### Augmentation - Gaussian noise (p=0.4, amplitude 0.001-0.015) - Time stretch (p=0.3, rate 0.9-1.1) - Random silence padding (p=0.5, 0-0.7s each end) - BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation) - Clipping (p=0.2, +/-0.8) - MP3 compression (p=0.3, 32-64 kbps) - SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05) ### Results | Epoch | Eval loss | WER | |-------|-----------|-----| | 1.0 | 0.0496 | 3.46% | | 2.0 | 0.0288 | 1.84% | | 3.0 | 0.0239 | 0.82% | | 4.0 | 0.0245 | 1.55% | | 5.0 | 0.0195 | 0.92% | | **6.0** | 0.0231 | **0.66%** | | 7.0 | 0.0199 | 0.70% | | 8.0 | 0.0211 | 2.62% | | 9.0 | 0.0191 | 0.72% | | 10.0 | 0.0186 | 4.43% | | 11.0 | 0.0172 | 0.69% | ## Usage ```python from faster_whisper import WhisperModel model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16") segments, info = model.transcribe( "audio.wav", language="en", beam_size=5, hotwords=( "tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai " "sembawang macritchie johor tekong batam hosba sijan changi nylon " "arama bobag samko remes betba bidus legol envum sudpo dosno venpa " "qnh rtb squawk mayday wilco affirm roger atis metar pirep blind " "glidepath centreline talkdown sigmet cavok colour " "downwind crosswind upwind abeam initials pitchout " "mekong taipan kingcup scorpion scallop termite carlton snakefly " "basking pelican cobra earlgrey bluebell maverick wolfman stinger " "jaguar lancer niner decimal flight level runway" ), ) text = " ".join(seg.text.strip() for seg in segments) # "camel cleared i l s approach runway three six" ``` ## Output Format The model outputs **normalized spoken text** (lowercase, fully expanded): | Input audio says | Model outputs | |-----------------|---------------| | "CAMEL climb flight level zero nine zero" | `camel climb flight level zero nine zero` | | "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` | | "Squawk seven seven zero zero" | `squawk seven seven zero zero` | A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration.