feat: update ASR model, mark LLM as legacy

f338e91 5 days ago

4.77 kB

	---
	language:
	- en
	license: other
	tags:
	- whisper
	- ctranslate2
	- automatic-speech-recognition
	- air-traffic-control
	- atc
	- singapore
	- military
	- faster-whisper
	base_model: openai/whisper-large-v3
	pipeline_tag: automatic-speech-recognition
	metrics:
	- wer
	model-index:
	- name: whisper-large-v3-atc-singapore
	results:
	- task:
	type: automatic-speech-recognition
	metrics:
	- name: WER
	type: wer
	value: 0.66
	---

	# Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)

	Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech recognition.

	## Performance

	\| Run \| WER \| Base \| Data \| Key Change \|
	\|-----\|-----\|------\|------\|------------\|
	\| ct2_run5 \| 0.48% \| jacktol/whisper-large-v3-finetuned-for-ATC \| 6,680 synthetic \| Baseline fine-tune \|
	\| ct2_run6 \| 0.40% \| jacktol/whisper-large-v3-finetuned-for-ATC \| 6,680 synthetic \| +augmentation, weight decay \|
	\| ct2_run7 \| 0.24% \| jacktol/whisper-large-v3-finetuned-for-ATC \| 6,730 (synthetic + real) \| +50 real recordings, frozen encoder \|
	\| ct2_run8 \| 0.66% \| openai/whisper-large-v3 \| Full retrain \| Fresh fine-tune from base, enhanced augmentation \|

	> Note: ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.

	## Model Details

	\| Key \| Value \|
	\|-----\|-------\|
	\| Base model \| `openai/whisper-large-v3` \|
	\| Format \| CTranslate2 float16 \|
	\| Size \| 2.9 GB \|
	\| Architecture \| Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) \|
	\| Best WER \| 0.66% (epoch 6) \|
	\| Domain \| Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) \|

	## Training

	- Full fine-tune from `openai/whisper-large-v3` (encoder + decoder)
	- Optimizer: AdamW 8-bit (bitsandbytes)
	- Learning rate: 1e-5 with linear schedule, 5% warmup
	- Effective batch size: 16 (1 per device x 16 gradient accumulation)
	- Mixed precision: fp16
	- Gradient checkpointing: enabled
	- Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)

	See [hyperparameters.md](./hyperparameters.md) for full training configuration.

	### Augmentation

	- Gaussian noise (p=0.4, amplitude 0.001-0.015)
	- Time stretch (p=0.3, rate 0.9-1.1)
	- Random silence padding (p=0.5, 0-0.7s each end)
	- BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
	- Clipping (p=0.2, +/-0.8)
	- MP3 compression (p=0.3, 32-64 kbps)
	- SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)

	### Results

	\| Epoch \| Eval loss \| WER \|
	\|-------\|-----------\|-----\|
	\| 1.0 \| 0.0496 \| 3.46% \|
	\| 2.0 \| 0.0288 \| 1.84% \|
	\| 3.0 \| 0.0239 \| 0.82% \|
	\| 4.0 \| 0.0245 \| 1.55% \|
	\| 5.0 \| 0.0195 \| 0.92% \|
	\| 6.0 \| 0.0231 \| 0.66% \|
	\| 7.0 \| 0.0199 \| 0.70% \|
	\| 8.0 \| 0.0211 \| 2.62% \|
	\| 9.0 \| 0.0191 \| 0.72% \|
	\| 10.0 \| 0.0186 \| 4.43% \|
	\| 11.0 \| 0.0172 \| 0.69% \|

	## Usage

	```python
	from faster_whisper import WhisperModel

	model = WhisperModel("path/to/ASR", device="cuda", compute_type="float16")
	segments, info = model.transcribe(
	"audio.wav",
	language="en",
	beam_size=5,
	hotwords=(
	"tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
	"sembawang macritchie johor tekong batam hosba sijan changi nylon "
	"arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
	"qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
	"glidepath centreline talkdown sigmet cavok colour "
	"downwind crosswind upwind abeam initials pitchout "
	"mekong taipan kingcup scorpion scallop termite carlton snakefly "
	"basking pelican cobra earlgrey bluebell maverick wolfman stinger "
	"jaguar lancer niner decimal flight level runway"
	),
	)
	text = " ".join(seg.text.strip() for seg in segments)
	# "camel cleared i l s approach runway three six"
	```

	## Output Format

	The model outputs normalized spoken text (lowercase, fully expanded):

	\| Input audio says \| Model outputs \|
	\|-----------------\|---------------\|
	\| "CAMEL climb flight level zero nine zero" \| `camel climb flight level zero nine zero` \|
	\| "Contact Tengah Approach one three zero decimal zero" \| `contact tengah approach one three zero decimal zero` \|
	\| "Squawk seven seven zero zero" \| `squawk seven seven zero zero` \|

	A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration.