feat: update ASR model, mark LLM as legacy

f338e91 5 days ago

4.77 kB

language:
  - en
license: other
tags:
  - whisper
  - ctranslate2
  - automatic-speech-recognition
  - air-traffic-control
  - atc
  - singapore
  - military
  - faster-whisper
base_model: openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
metrics:
  - wer
model-index:
  - name: whisper-large-v3-atc-singapore
    results:
      - task:
          type: automatic-speech-recognition
        metrics:
          - name: WER
            type: wer
            value: 0.66

Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)

Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition.

Performance

Run	WER	Base	Data	Key Change
ct2_run5	0.48%	jacktol/whisper-large-v3-finetuned-for-ATC	6,680 synthetic	Baseline fine-tune
ct2_run6	0.40%	jacktol/whisper-large-v3-finetuned-for-ATC	6,680 synthetic	+augmentation, weight decay
ct2_run7	0.24%	jacktol/whisper-large-v3-finetuned-for-ATC	6,730 (synthetic + real)	+50 real recordings, frozen encoder
ct2_run8	0.66%	openai/whisper-large-v3	Full retrain	Fresh fine-tune from base, enhanced augmentation

Note: ct2_run8 starts from the original openai/whisper-large-v3 base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.

Model Details

Key	Value
Base model	`openai/whisper-large-v3`
Format	CTranslate2 float16
Size	2.9 GB
Architecture	Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280)
Best WER	0.66% (epoch 6)
Domain	Singapore military ATC (Tengah WSAT, Paya Lebar WSAP)

Training

Full fine-tune from openai/whisper-large-v3 (encoder + decoder)
Optimizer: AdamW 8-bit (bitsandbytes)
Learning rate: 1e-5 with linear schedule, 5% warmup
Effective batch size: 16 (1 per device x 16 gradient accumulation)
Mixed precision: fp16
Gradient checkpointing: enabled
Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)

See hyperparameters.md for full training configuration.

Augmentation

Gaussian noise (p=0.4, amplitude 0.001-0.015)
Time stretch (p=0.3, rate 0.9-1.1)
Random silence padding (p=0.5, 0-0.7s each end)
BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
Clipping (p=0.2, +/-0.8)
MP3 compression (p=0.3, 32-64 kbps)
SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)

Results

Epoch	Eval loss	WER
1.0	0.0496	3.46%
2.0	0.0288	1.84%
3.0	0.0239	0.82%
4.0	0.0245	1.55%
5.0	0.0195	0.92%
6.0	0.0231	0.66%
7.0	0.0199	0.70%
8.0	0.0211	2.62%
9.0	0.0191	0.72%
10.0	0.0186	4.43%
11.0	0.0172	0.69%

Usage

from faster_whisper import WhisperModel

model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16")
segments, info = model.transcribe(
    "audio.wav",
    language="en",
    beam_size=5,
    hotwords=(
        "tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
        "sembawang macritchie johor tekong batam hosba sijan changi nylon "
        "arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
        "qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
        "glidepath centreline talkdown sigmet cavok colour "
        "downwind crosswind upwind abeam initials pitchout "
        "mekong taipan kingcup scorpion scallop termite carlton snakefly "
        "basking pelican cobra earlgrey bluebell maverick wolfman stinger "
        "jaguar lancer niner decimal flight level runway"
    ),
)
text = " ".join(seg.text.strip() for seg in segments)
# "camel cleared i l s approach runway three six"

Output Format

The model outputs normalized spoken text (lowercase, fully expanded):

Input audio says	Model outputs
"CAMEL climb flight level zero nine zero"	`camel climb flight level zero nine zero`
"Contact Tengah Approach one three zero decimal zero"	`contact tengah approach one three zero decimal zero`
"Squawk seven seven zero zero"	`squawk seven seven zero zero`

A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., CAMEL climb FL090). See the ASTRA simpilot pipeline for the full integration.