File size: 4,768 Bytes

---
language:
  - en
license: other
tags:
  - whisper
  - ctranslate2
  - automatic-speech-recognition
  - air-traffic-control
  - atc
  - singapore
  - military
  - faster-whisper
base_model: openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
metrics:
  - wer
model-index:
  - name: whisper-large-v3-atc-singapore
    results:
      - task:
          type: automatic-speech-recognition
        metrics:
          - name: WER
            type: wer
            value: 0.66
---

# Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)

Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition.

## Performance

| Run | WER | Base | Data | Key Change |
|-----|-----|------|------|------------|
| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune |
| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay |
| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder |
| **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation |

> **Note:** ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.

## Model Details

| Key | Value |
|-----|-------|
| Base model | `openai/whisper-large-v3` |
| Format | CTranslate2 float16 |
| Size | 2.9 GB |
| Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) |
| Best WER | 0.66% (epoch 6) |
| Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |

## Training

- **Full fine-tune** from `openai/whisper-large-v3` (encoder + decoder)
- Optimizer: AdamW 8-bit (bitsandbytes)
- Learning rate: 1e-5 with linear schedule, 5% warmup
- Effective batch size: 16 (1 per device x 16 gradient accumulation)
- Mixed precision: fp16
- Gradient checkpointing: enabled
- Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)

See [hyperparameters.md](./hyperparameters.md) for full training configuration.

### Augmentation

- Gaussian noise (p=0.4, amplitude 0.001-0.015)
- Time stretch (p=0.3, rate 0.9-1.1)
- Random silence padding (p=0.5, 0-0.7s each end)
- BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
- Clipping (p=0.2, +/-0.8)
- MP3 compression (p=0.3, 32-64 kbps)
- SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)

### Results

| Epoch | Eval loss | WER |
|-------|-----------|-----|
| 1.0 | 0.0496 | 3.46% |
| 2.0 | 0.0288 | 1.84% |
| 3.0 | 0.0239 | 0.82% |
| 4.0 | 0.0245 | 1.55% |
| 5.0 | 0.0195 | 0.92% |
| **6.0** | 0.0231 | **0.66%** |
| 7.0 | 0.0199 | 0.70% |
| 8.0 | 0.0211 | 2.62% |
| 9.0 | 0.0191 | 0.72% |
| 10.0 | 0.0186 | 4.43% |
| 11.0 | 0.0172 | 0.69% |

## Usage

```python
from faster_whisper import WhisperModel

model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "audio.wav",
    language="en",
    beam_size=5,
    hotwords=(
        "tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
        "sembawang macritchie johor tekong batam hosba sijan changi nylon "
        "arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
        "qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
        "glidepath centreline talkdown sigmet cavok colour "
        "downwind crosswind upwind abeam initials pitchout "
        "mekong taipan kingcup scorpion scallop termite carlton snakefly "
        "basking pelican cobra earlgrey bluebell maverick wolfman stinger "
        "jaguar lancer niner decimal flight level runway"
    ),
)
text = " ".join(seg.text.strip() for seg in segments)
# "camel cleared i l s approach runway three six"
```

## Output Format

The model outputs **normalized spoken text** (lowercase, fully expanded):

| Input audio says | Model outputs |
|-----------------|---------------|
| "CAMEL climb flight level zero nine zero" | `camel climb flight level zero nine zero` |
| "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` |
| "Squawk seven seven zero zero" | `squawk seven seven zero zero` |

A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration.