File size: 4,768 Bytes
319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 319d77e f338e91 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
language:
- en
license: other
tags:
- whisper
- ctranslate2
- automatic-speech-recognition
- air-traffic-control
- atc
- singapore
- military
- faster-whisper
base_model: openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
metrics:
- wer
model-index:
- name: whisper-large-v3-atc-singapore
results:
- task:
type: automatic-speech-recognition
metrics:
- name: WER
type: wer
value: 0.66
---
# Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)
Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition.
## Performance
| Run | WER | Base | Data | Key Change |
|-----|-----|------|------|------------|
| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune |
| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay |
| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder |
| **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation |
> **Note:** ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.
## Model Details
| Key | Value |
|-----|-------|
| Base model | `openai/whisper-large-v3` |
| Format | CTranslate2 float16 |
| Size | 2.9 GB |
| Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) |
| Best WER | 0.66% (epoch 6) |
| Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |
## Training
- **Full fine-tune** from `openai/whisper-large-v3` (encoder + decoder)
- Optimizer: AdamW 8-bit (bitsandbytes)
- Learning rate: 1e-5 with linear schedule, 5% warmup
- Effective batch size: 16 (1 per device x 16 gradient accumulation)
- Mixed precision: fp16
- Gradient checkpointing: enabled
- Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)
See [hyperparameters.md](./hyperparameters.md) for full training configuration.
### Augmentation
- Gaussian noise (p=0.4, amplitude 0.001-0.015)
- Time stretch (p=0.3, rate 0.9-1.1)
- Random silence padding (p=0.5, 0-0.7s each end)
- BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
- Clipping (p=0.2, +/-0.8)
- MP3 compression (p=0.3, 32-64 kbps)
- SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)
### Results
| Epoch | Eval loss | WER |
|-------|-----------|-----|
| 1.0 | 0.0496 | 3.46% |
| 2.0 | 0.0288 | 1.84% |
| 3.0 | 0.0239 | 0.82% |
| 4.0 | 0.0245 | 1.55% |
| 5.0 | 0.0195 | 0.92% |
| **6.0** | 0.0231 | **0.66%** |
| 7.0 | 0.0199 | 0.70% |
| 8.0 | 0.0211 | 2.62% |
| 9.0 | 0.0191 | 0.72% |
| 10.0 | 0.0186 | 4.43% |
| 11.0 | 0.0172 | 0.69% |
## Usage
```python
from faster_whisper import WhisperModel
model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16")
segments, info = model.transcribe(
"audio.wav",
language="en",
beam_size=5,
hotwords=(
"tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
"sembawang macritchie johor tekong batam hosba sijan changi nylon "
"arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
"qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
"glidepath centreline talkdown sigmet cavok colour "
"downwind crosswind upwind abeam initials pitchout "
"mekong taipan kingcup scorpion scallop termite carlton snakefly "
"basking pelican cobra earlgrey bluebell maverick wolfman stinger "
"jaguar lancer niner decimal flight level runway"
),
)
text = " ".join(seg.text.strip() for seg in segments)
# "camel cleared i l s approach runway three six"
```
## Output Format
The model outputs **normalized spoken text** (lowercase, fully expanded):
| Input audio says | Model outputs |
|-----------------|---------------|
| "CAMEL climb flight level zero nine zero" | `camel climb flight level zero nine zero` |
| "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` |
| "Squawk seven seven zero zero" | `squawk seven seven zero zero` |
A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration.
|