astra-atc-models / ASR /README.md
RanenSim's picture
feat: update ASR model, mark LLM as legacy
f338e91
metadata
language:
  - en
license: other
tags:
  - whisper
  - ctranslate2
  - automatic-speech-recognition
  - air-traffic-control
  - atc
  - singapore
  - military
  - faster-whisper
base_model: openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
metrics:
  - wer
model-index:
  - name: whisper-large-v3-atc-singapore
    results:
      - task:
          type: automatic-speech-recognition
        metrics:
          - name: WER
            type: wer
            value: 0.66

Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)

Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition.

Performance

Run WER Base Data Key Change
ct2_run5 0.48% jacktol/whisper-large-v3-finetuned-for-ATC 6,680 synthetic Baseline fine-tune
ct2_run6 0.40% jacktol/whisper-large-v3-finetuned-for-ATC 6,680 synthetic +augmentation, weight decay
ct2_run7 0.24% jacktol/whisper-large-v3-finetuned-for-ATC 6,730 (synthetic + real) +50 real recordings, frozen encoder
ct2_run8 0.66% openai/whisper-large-v3 Full retrain Fresh fine-tune from base, enhanced augmentation

Note: ct2_run8 starts from the original openai/whisper-large-v3 base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.

Model Details

Key Value
Base model openai/whisper-large-v3
Format CTranslate2 float16
Size 2.9 GB
Architecture Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280)
Best WER 0.66% (epoch 6)
Domain Singapore military ATC (Tengah WSAT, Paya Lebar WSAP)

Training

  • Full fine-tune from openai/whisper-large-v3 (encoder + decoder)
  • Optimizer: AdamW 8-bit (bitsandbytes)
  • Learning rate: 1e-5 with linear schedule, 5% warmup
  • Effective batch size: 16 (1 per device x 16 gradient accumulation)
  • Mixed precision: fp16
  • Gradient checkpointing: enabled
  • Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)

See hyperparameters.md for full training configuration.

Augmentation

  • Gaussian noise (p=0.4, amplitude 0.001-0.015)
  • Time stretch (p=0.3, rate 0.9-1.1)
  • Random silence padding (p=0.5, 0-0.7s each end)
  • BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
  • Clipping (p=0.2, +/-0.8)
  • MP3 compression (p=0.3, 32-64 kbps)
  • SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)

Results

Epoch Eval loss WER
1.0 0.0496 3.46%
2.0 0.0288 1.84%
3.0 0.0239 0.82%
4.0 0.0245 1.55%
5.0 0.0195 0.92%
6.0 0.0231 0.66%
7.0 0.0199 0.70%
8.0 0.0211 2.62%
9.0 0.0191 0.72%
10.0 0.0186 4.43%
11.0 0.0172 0.69%

Usage

from faster_whisper import WhisperModel

model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "audio.wav",
    language="en",
    beam_size=5,
    hotwords=(
        "tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
        "sembawang macritchie johor tekong batam hosba sijan changi nylon "
        "arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
        "qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
        "glidepath centreline talkdown sigmet cavok colour "
        "downwind crosswind upwind abeam initials pitchout "
        "mekong taipan kingcup scorpion scallop termite carlton snakefly "
        "basking pelican cobra earlgrey bluebell maverick wolfman stinger "
        "jaguar lancer niner decimal flight level runway"
    ),
)
text = " ".join(seg.text.strip() for seg in segments)
# "camel cleared i l s approach runway three six"

Output Format

The model outputs normalized spoken text (lowercase, fully expanded):

Input audio says Model outputs
"CAMEL climb flight level zero nine zero" camel climb flight level zero nine zero
"Contact Tengah Approach one three zero decimal zero" contact tengah approach one three zero decimal zero
"Squawk seven seven zero zero" squawk seven seven zero zero

A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., CAMEL climb FL090). See the ASTRA simpilot pipeline for the full integration.