| --- |
| language: |
| - en |
| license: other |
| tags: |
| - whisper |
| - ctranslate2 |
| - automatic-speech-recognition |
| - air-traffic-control |
| - atc |
| - singapore |
| - military |
| - faster-whisper |
| base_model: openai/whisper-large-v3 |
| pipeline_tag: automatic-speech-recognition |
| metrics: |
| - wer |
| model-index: |
| - name: whisper-large-v3-atc-singapore |
| results: |
| - task: |
| type: automatic-speech-recognition |
| metrics: |
| - name: WER |
| type: wer |
| value: 0.66 |
| --- |
| |
| # Whisper Large v3 — Singapore Military ATC (CTranslate2 float16) |
|
|
| Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition. |
|
|
| ## Performance |
|
|
| | Run | WER | Base | Data | Key Change | |
| |-----|-----|------|------|------------| |
| | ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune | |
| | ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay | |
| | ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder | |
| | **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation | |
| |
| > **Note:** ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation. |
|
|
| ## Model Details |
|
|
| | Key | Value | |
| |-----|-------| |
| | Base model | `openai/whisper-large-v3` | |
| | Format | CTranslate2 float16 | |
| | Size | 2.9 GB | |
| | Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) | |
| | Best WER | 0.66% (epoch 6) | |
| | Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) | |
| |
| ## Training |
| |
| - **Full fine-tune** from `openai/whisper-large-v3` (encoder + decoder) |
| - Optimizer: AdamW 8-bit (bitsandbytes) |
| - Learning rate: 1e-5 with linear schedule, 5% warmup |
| - Effective batch size: 16 (1 per device x 16 gradient accumulation) |
| - Mixed precision: fp16 |
| - Gradient checkpointing: enabled |
| - Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6) |
| |
| See [hyperparameters.md](./hyperparameters.md) for full training configuration. |
| |
| ### Augmentation |
| |
| - Gaussian noise (p=0.4, amplitude 0.001-0.015) |
| - Time stretch (p=0.3, rate 0.9-1.1) |
| - Random silence padding (p=0.5, 0-0.7s each end) |
| - BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation) |
| - Clipping (p=0.2, +/-0.8) |
| - MP3 compression (p=0.3, 32-64 kbps) |
| - SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05) |
| |
| ### Results |
| |
| | Epoch | Eval loss | WER | |
| |-------|-----------|-----| |
| | 1.0 | 0.0496 | 3.46% | |
| | 2.0 | 0.0288 | 1.84% | |
| | 3.0 | 0.0239 | 0.82% | |
| | 4.0 | 0.0245 | 1.55% | |
| | 5.0 | 0.0195 | 0.92% | |
| | **6.0** | 0.0231 | **0.66%** | |
| | 7.0 | 0.0199 | 0.70% | |
| | 8.0 | 0.0211 | 2.62% | |
| | 9.0 | 0.0191 | 0.72% | |
| | 10.0 | 0.0186 | 4.43% | |
| | 11.0 | 0.0172 | 0.69% | |
| |
| ## Usage |
| |
| ```python |
| from faster_whisper import WhisperModel |
|
|
| model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16") |
| segments, info = model.transcribe( |
| "audio.wav", |
| language="en", |
| beam_size=5, |
| hotwords=( |
| "tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai " |
| "sembawang macritchie johor tekong batam hosba sijan changi nylon " |
| "arama bobag samko remes betba bidus legol envum sudpo dosno venpa " |
| "qnh rtb squawk mayday wilco affirm roger atis metar pirep blind " |
| "glidepath centreline talkdown sigmet cavok colour " |
| "downwind crosswind upwind abeam initials pitchout " |
| "mekong taipan kingcup scorpion scallop termite carlton snakefly " |
| "basking pelican cobra earlgrey bluebell maverick wolfman stinger " |
| "jaguar lancer niner decimal flight level runway" |
| ), |
| ) |
| text = " ".join(seg.text.strip() for seg in segments) |
| # "camel cleared i l s approach runway three six" |
| ``` |
| |
| ## Output Format |
|
|
| The model outputs **normalized spoken text** (lowercase, fully expanded): |
|
|
| | Input audio says | Model outputs | |
| |-----------------|---------------| |
| | "CAMEL climb flight level zero nine zero" | `camel climb flight level zero nine zero` | |
| | "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` | |
| | "Squawk seven seven zero zero" | `squawk seven seven zero zero` | |
|
|
| A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration. |
|
|