feat: update ASR model, mark LLM as legacy
Browse files- ASR/README.md +55 -23
- ASR/config.json +2 -1
- ASR/hyperparameters.md +82 -0
- ASR/model.bin +1 -1
- LLM/README.md +15 -1
- README.md +41 -61
ASR/README.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
- singapore
|
| 12 |
- military
|
| 13 |
- faster-whisper
|
| 14 |
-
base_model:
|
| 15 |
pipeline_tag: automatic-speech-recognition
|
| 16 |
metrics:
|
| 17 |
- wer
|
|
@@ -23,7 +23,7 @@ model-index:
|
|
| 23 |
metrics:
|
| 24 |
- name: WER
|
| 25 |
type: wer
|
| 26 |
-
value: 0.
|
| 27 |
---
|
| 28 |
|
| 29 |
# Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)
|
|
@@ -32,41 +32,63 @@ Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech r
|
|
| 32 |
|
| 33 |
## Performance
|
| 34 |
|
| 35 |
-
| Run | WER | Data | Key Change |
|
| 36 |
-
|-----|-----|------|------------|
|
| 37 |
-
| ct2_run5 | 0.48% | 6,680 synthetic | Baseline fine-tune |
|
| 38 |
-
| ct2_run6 | 0.40% | 6,680 synthetic | +augmentation, weight decay |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## Model Details
|
| 42 |
|
| 43 |
| Key | Value |
|
| 44 |
|-----|-------|
|
| 45 |
-
| Base model | `
|
| 46 |
| Format | CTranslate2 float16 |
|
| 47 |
| Size | 2.9 GB |
|
| 48 |
-
|
|
|
|
|
| 49 |
| Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |
|
| 50 |
|
| 51 |
## Training
|
| 52 |
|
| 53 |
-
- **
|
| 54 |
-
-
|
| 55 |
-
- Learning rate:
|
| 56 |
-
-
|
| 57 |
-
- Effective batch size: 16
|
| 58 |
- Mixed precision: fp16
|
| 59 |
-
-
|
| 60 |
-
|
| 61 |
-
### Dataset
|
| 62 |
|
| 63 |
-
|
| 64 |
-
- 50 real human recordings (20x oversampled = 1,000 effective entries)
|
| 65 |
-
- Total: 6,730 entries
|
| 66 |
|
| 67 |
### Augmentation
|
| 68 |
|
| 69 |
-
Gaussian noise
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
## Usage
|
| 72 |
|
|
@@ -78,7 +100,17 @@ segments, info = model.transcribe(
|
|
| 78 |
"audio.wav",
|
| 79 |
language="en",
|
| 80 |
beam_size=5,
|
| 81 |
-
hotwords=
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
)
|
| 83 |
text = " ".join(seg.text.strip() for seg in segments)
|
| 84 |
# "camel cleared i l s approach runway three six"
|
|
@@ -94,4 +126,4 @@ The model outputs **normalized spoken text** (lowercase, fully expanded):
|
|
| 94 |
| "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` |
|
| 95 |
| "Squawk seven seven zero zero" | `squawk seven seven zero zero` |
|
| 96 |
|
| 97 |
-
|
|
|
|
| 11 |
- singapore
|
| 12 |
- military
|
| 13 |
- faster-whisper
|
| 14 |
+
base_model: openai/whisper-large-v3
|
| 15 |
pipeline_tag: automatic-speech-recognition
|
| 16 |
metrics:
|
| 17 |
- wer
|
|
|
|
| 23 |
metrics:
|
| 24 |
- name: WER
|
| 25 |
type: wer
|
| 26 |
+
value: 0.66
|
| 27 |
---
|
| 28 |
|
| 29 |
# Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)
|
|
|
|
| 32 |
|
| 33 |
## Performance
|
| 34 |
|
| 35 |
+
| Run | WER | Base | Data | Key Change |
|
| 36 |
+
|-----|-----|------|------|------------|
|
| 37 |
+
| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune |
|
| 38 |
+
| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay |
|
| 39 |
+
| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder |
|
| 40 |
+
| **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation |
|
| 41 |
+
|
| 42 |
+
> **Note:** ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.
|
| 43 |
|
| 44 |
## Model Details
|
| 45 |
|
| 46 |
| Key | Value |
|
| 47 |
|-----|-------|
|
| 48 |
+
| Base model | `openai/whisper-large-v3` |
|
| 49 |
| Format | CTranslate2 float16 |
|
| 50 |
| Size | 2.9 GB |
|
| 51 |
+
| Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) |
|
| 52 |
+
| Best WER | 0.66% (epoch 6) |
|
| 53 |
| Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |
|
| 54 |
|
| 55 |
## Training
|
| 56 |
|
| 57 |
+
- **Full fine-tune** from `openai/whisper-large-v3` (encoder + decoder)
|
| 58 |
+
- Optimizer: AdamW 8-bit (bitsandbytes)
|
| 59 |
+
- Learning rate: 1e-5 with linear schedule, 5% warmup
|
| 60 |
+
- Effective batch size: 16 (1 per device x 16 gradient accumulation)
|
|
|
|
| 61 |
- Mixed precision: fp16
|
| 62 |
+
- Gradient checkpointing: enabled
|
| 63 |
+
- Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)
|
|
|
|
| 64 |
|
| 65 |
+
See [hyperparameters.md](./hyperparameters.md) for full training configuration.
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### Augmentation
|
| 68 |
|
| 69 |
+
- Gaussian noise (p=0.4, amplitude 0.001-0.015)
|
| 70 |
+
- Time stretch (p=0.3, rate 0.9-1.1)
|
| 71 |
+
- Random silence padding (p=0.5, 0-0.7s each end)
|
| 72 |
+
- BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
|
| 73 |
+
- Clipping (p=0.2, +/-0.8)
|
| 74 |
+
- MP3 compression (p=0.3, 32-64 kbps)
|
| 75 |
+
- SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)
|
| 76 |
+
|
| 77 |
+
### Results
|
| 78 |
+
|
| 79 |
+
| Epoch | Eval loss | WER |
|
| 80 |
+
|-------|-----------|-----|
|
| 81 |
+
| 1.0 | 0.0496 | 3.46% |
|
| 82 |
+
| 2.0 | 0.0288 | 1.84% |
|
| 83 |
+
| 3.0 | 0.0239 | 0.82% |
|
| 84 |
+
| 4.0 | 0.0245 | 1.55% |
|
| 85 |
+
| 5.0 | 0.0195 | 0.92% |
|
| 86 |
+
| **6.0** | 0.0231 | **0.66%** |
|
| 87 |
+
| 7.0 | 0.0199 | 0.70% |
|
| 88 |
+
| 8.0 | 0.0211 | 2.62% |
|
| 89 |
+
| 9.0 | 0.0191 | 0.72% |
|
| 90 |
+
| 10.0 | 0.0186 | 4.43% |
|
| 91 |
+
| 11.0 | 0.0172 | 0.69% |
|
| 92 |
|
| 93 |
## Usage
|
| 94 |
|
|
|
|
| 100 |
"audio.wav",
|
| 101 |
language="en",
|
| 102 |
beam_size=5,
|
| 103 |
+
hotwords=(
|
| 104 |
+
"tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
|
| 105 |
+
"sembawang macritchie johor tekong batam hosba sijan changi nylon "
|
| 106 |
+
"arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
|
| 107 |
+
"qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
|
| 108 |
+
"glidepath centreline talkdown sigmet cavok colour "
|
| 109 |
+
"downwind crosswind upwind abeam initials pitchout "
|
| 110 |
+
"mekong taipan kingcup scorpion scallop termite carlton snakefly "
|
| 111 |
+
"basking pelican cobra earlgrey bluebell maverick wolfman stinger "
|
| 112 |
+
"jaguar lancer niner decimal flight level runway"
|
| 113 |
+
),
|
| 114 |
)
|
| 115 |
text = " ".join(seg.text.strip() for seg in segments)
|
| 116 |
# "camel cleared i l s approach runway three six"
|
|
|
|
| 126 |
| "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` |
|
| 127 |
| "Squawk seven seven zero zero" | `squawk seven seven zero zero` |
|
| 128 |
|
| 129 |
+
A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration.
|
ASR/config.json
CHANGED
|
@@ -145,6 +145,7 @@
|
|
| 145 |
],
|
| 146 |
"suppress_ids": [],
|
| 147 |
"suppress_ids_begin": [
|
| 148 |
-
220
|
|
|
|
| 149 |
]
|
| 150 |
}
|
|
|
|
| 145 |
],
|
| 146 |
"suppress_ids": [],
|
| 147 |
"suppress_ids_begin": [
|
| 148 |
+
220,
|
| 149 |
+
50257
|
| 150 |
]
|
| 151 |
}
|
ASR/hyperparameters.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hyperparameters — Whisper ATC Fine-tune
|
| 2 |
+
|
| 3 |
+
## Model
|
| 4 |
+
|
| 5 |
+
| Key | Value |
|
| 6 |
+
|-----|-------|
|
| 7 |
+
| Base model | `openai/whisper-large-v3` |
|
| 8 |
+
| Architecture | Whisper Large v3 |
|
| 9 |
+
| d_model | 1280 |
|
| 10 |
+
| Encoder layers | 32 |
|
| 11 |
+
| Decoder layers | 32 |
|
| 12 |
+
| Encoder attention heads | 20 |
|
| 13 |
+
| Decoder attention heads | 20 |
|
| 14 |
+
| Mel bins | 128 |
|
| 15 |
+
|
| 16 |
+
## Training
|
| 17 |
+
|
| 18 |
+
| Key | Value |
|
| 19 |
+
|-----|-------|
|
| 20 |
+
| Optimizer | AdamW (bitsandbytes 8-bit) |
|
| 21 |
+
| Learning rate | 1e-05 |
|
| 22 |
+
| LR scheduler | Linear |
|
| 23 |
+
| Warmup ratio | 0.05 |
|
| 24 |
+
| Adam β₁ / β₂ / ε | 0.9 / 0.999 / 1e-8 |
|
| 25 |
+
| Weight decay | 0.01 |
|
| 26 |
+
| Per-device train batch size | 1 |
|
| 27 |
+
| Per-device eval batch size | 8 |
|
| 28 |
+
| Gradient accumulation steps | 16 |
|
| 29 |
+
| Effective batch size | 16 |
|
| 30 |
+
| Gradient checkpointing | Yes (use_reentrant=False) |
|
| 31 |
+
| Mixed precision | fp16 |
|
| 32 |
+
| Max grad norm | 1.0 |
|
| 33 |
+
| Max epochs (configured) | 25 |
|
| 34 |
+
| Early stop patience | 5 epochs |
|
| 35 |
+
| Label smoothing | 0.0 |
|
| 36 |
+
| Freeze encoder | No |
|
| 37 |
+
| Seed | 42 |
|
| 38 |
+
|
| 39 |
+
## Augmentation
|
| 40 |
+
|
| 41 |
+
- Gaussian noise (p=0.4, amplitude 0.001–0.015)
|
| 42 |
+
- Time stretch (p=0.3, rate 0.9–1.1)
|
| 43 |
+
- Random silence padding (p=0.5, 0–0.7s each end)
|
| 44 |
+
- BandPassFilter (p=0.75, 300–3400 Hz, VHF radio simulation)
|
| 45 |
+
- Clip (p=0.2, ±0.8)
|
| 46 |
+
- Mp3Compression (p=0.3, 32–64 kbps)
|
| 47 |
+
- SpecAugment: FrequencyMasking(freq\_mask\_param=27) + TimeMasking(time\_mask\_param=100, p=0.05)
|
| 48 |
+
|
| 49 |
+
## Early stopping
|
| 50 |
+
|
| 51 |
+
| Key | Value |
|
| 52 |
+
|-----|-------|
|
| 53 |
+
| Metric | WER (lower is better) |
|
| 54 |
+
| Stopped at | Step 6919 / Epoch 11 |
|
| 55 |
+
| Patience | 5 epochs |
|
| 56 |
+
|
| 57 |
+
## Results
|
| 58 |
+
|
| 59 |
+
| Epoch | Eval loss | WER |
|
| 60 |
+
|-------|-----------|-----|
|
| 61 |
+
| 1.0 | 0.0496 | 3.46% |
|
| 62 |
+
| 2.0 | 0.0288 | 1.84% |
|
| 63 |
+
| 3.0 | 0.0239 | 0.82% |
|
| 64 |
+
| 4.0 | 0.0245 | 1.55% |
|
| 65 |
+
| 5.0 | 0.0195 | 0.92% |
|
| 66 |
+
| 6.0 | 0.0231 | **0.66%** ← best |
|
| 67 |
+
| 7.0 | 0.0199 | 0.70% |
|
| 68 |
+
| 8.0 | 0.0211 | 2.62% |
|
| 69 |
+
| 9.0 | 0.0191 | 0.72% |
|
| 70 |
+
| 10.0 | 0.0186 | 4.43% |
|
| 71 |
+
| 11.0 | 0.0172 | 0.69% |
|
| 72 |
+
|
| 73 |
+
Best checkpoint: `training/output_run8/checkpoint-3774` (epoch 6, WER 0.66%)
|
| 74 |
+
|
| 75 |
+
## Output
|
| 76 |
+
|
| 77 |
+
| Key | Value |
|
| 78 |
+
|-----|-------|
|
| 79 |
+
| Best HF checkpoint | `training/output_run8/best/` |
|
| 80 |
+
| CTranslate2 model | `training/saved_models/ct2_run8/` |
|
| 81 |
+
| Quantization | float16 |
|
| 82 |
+
| Inference backend | faster-whisper |
|
ASR/model.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 3087284237
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d9c466b737a94599b153a4e396dc51e321283e911b8ef59d28e687ff72564874
|
| 3 |
size 3087284237
|
LLM/README.md
CHANGED
|
@@ -12,10 +12,13 @@ tags:
|
|
| 12 |
- military
|
| 13 |
- lora
|
| 14 |
- unsloth
|
|
|
|
| 15 |
base_model: unsloth/Qwen3-1.7B
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# Qwen3-1.7B — ATC Display Text Formatter
|
|
|
|
|
|
|
| 19 |
|
| 20 |
Fine-tuned Qwen3-1.7B that converts normalized ASR output into structured ATC display text. Designed to work downstream of the companion Whisper ASR model.
|
| 21 |
|
|
@@ -27,6 +30,17 @@ Fine-tuned Qwen3-1.7B that converts normalized ASR output into structured ATC di
|
|
| 27 |
| Avg character edit distance | 0.0 |
|
| 28 |
| Best eval loss | 0.0005 |
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Model Details
|
| 31 |
|
| 32 |
| Key | Value |
|
|
|
|
| 12 |
- military
|
| 13 |
- lora
|
| 14 |
- unsloth
|
| 15 |
+
- legacy
|
| 16 |
base_model: unsloth/Qwen3-1.7B
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Qwen3-1.7B — ATC Display Text Formatter (Legacy)
|
| 20 |
+
|
| 21 |
+
> **Status: Legacy.** This model has been superseded by a deterministic rule-based formatter (23 rules, <1ms, 0 VRAM) that achieves equivalent accuracy on all production ATC patterns. The rule-based formatter is now used exclusively in the ASTRA pipeline. This model is retained for reference and potential future use with novel/unseen patterns.
|
| 22 |
|
| 23 |
Fine-tuned Qwen3-1.7B that converts normalized ASR output into structured ATC display text. Designed to work downstream of the companion Whisper ASR model.
|
| 24 |
|
|
|
|
| 30 |
| Avg character edit distance | 0.0 |
|
| 31 |
| Best eval loss | 0.0005 |
|
| 32 |
|
| 33 |
+
## Why Legacy?
|
| 34 |
+
|
| 35 |
+
The rule-based formatter now handles all production patterns:
|
| 36 |
+
- **Speed**: <1ms vs ~250ms per inference
|
| 37 |
+
- **VRAM**: 0 GB vs ~3.3 GB
|
| 38 |
+
- **Determinism**: 100% reproducible output, no sampling variance
|
| 39 |
+
- **Auditability**: Each of the 23 rules is individually testable
|
| 40 |
+
- **Coverage**: Handles all callsigns, locations, numeric patterns, and ATC abbreviations seen in training data
|
| 41 |
+
|
| 42 |
+
The LLM remains useful if novel patterns emerge that the rule-based system cannot handle.
|
| 43 |
+
|
| 44 |
## Model Details
|
| 45 |
|
| 46 |
| Key | Value |
|
README.md
CHANGED
|
@@ -17,13 +17,18 @@ pipeline_tag: automatic-speech-recognition
|
|
| 17 |
|
| 18 |
# ASTRA ATC Models
|
| 19 |
|
| 20 |
-
Fine-tuned
|
|
|
|
|
|
|
| 21 |
|
| 22 |
```
|
| 23 |
-
Audio -->
|
| 24 |
-
|
|
|
|
| 25 |
```
|
| 26 |
|
|
|
|
|
|
|
| 27 |
## Models
|
| 28 |
|
| 29 |
### [ASR/](./ASR) — Whisper Large v3 (CTranslate2 float16)
|
|
@@ -32,72 +37,65 @@ Fine-tuned for Singapore military ATC speech. Uses CTranslate2 float16 format fo
|
|
| 32 |
|
| 33 |
| Metric | Value |
|
| 34 |
|--------|-------|
|
| 35 |
-
| WER | **0.
|
| 36 |
-
| Base model | `
|
| 37 |
| Size | 2.9 GB |
|
| 38 |
-
| Training
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
Converts normalized ASR output into structured ATC display text
|
| 43 |
|
| 44 |
| Metric | Value |
|
| 45 |
|--------|-------|
|
| 46 |
| Exact match | **100%** (161/161) |
|
| 47 |
| Base model | `unsloth/Qwen3-1.7B` |
|
| 48 |
| Size | 3.3 GB |
|
| 49 |
-
| Training data | 1,915 examples |
|
| 50 |
|
| 51 |
-
##
|
| 52 |
-
|
| 53 |
-
In production, the models are chained with **confidence-based routing**:
|
| 54 |
-
|
| 55 |
-
- **ASR confidence >= 90%** — rule-based formatter (23 deterministic rules, <1ms, 0 VRAM)
|
| 56 |
-
- **ASR confidence < 90%** — LLM formatter (handles noisy/ambiguous ASR output better)
|
| 57 |
|
| 58 |
```
|
| 59 |
-
Audio --> VAD (Silero) --> ASR (Whisper ct2) --> Post-processing
|
| 60 |
-
|
|
| 61 |
-
confidence >= 0.90?
|
| 62 |
-
/ \
|
| 63 |
-
yes no
|
| 64 |
-
| |
|
| 65 |
-
Rule formatter LLM formatter
|
| 66 |
-
| |
|
| 67 |
-
\ /
|
| 68 |
-
--> Display text
|
| 69 |
```
|
| 70 |
|
| 71 |
-
|
|
| 72 |
-
|-------|------|
|
| 73 |
-
|
|
| 74 |
-
| ASR
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
## Domain
|
| 77 |
|
| 78 |
Singapore military ATC covering:
|
| 79 |
- **Airbases**: Tengah (WSAT, runway 18/36), Paya Lebar (WSAP, runway 02/20)
|
| 80 |
-
- **Aircraft**: F-16C/D, F-15SG, C-130
|
| 81 |
-
- **Approaches**: ILS, GCA, PAR, TACAN, DVOR/DME, Visual Straight-in
|
| 82 |
-
- **
|
| 83 |
- **Categories**: departure, approach, handoff, maneuver, landing, emergency, ground, recovery, pilot reports, military-specific ops
|
| 84 |
|
| 85 |
## Training History
|
| 86 |
|
| 87 |
### ASR
|
| 88 |
|
| 89 |
-
| Run | WER | Key Change |
|
| 90 |
-
|-----|-----|------------|
|
| 91 |
-
| ct2_run5 | 0.48% | Initial fine-tune
|
| 92 |
-
| ct2_run6 | 0.40% |
|
| 93 |
-
|
|
|
|
|
| 94 |
|
| 95 |
-
|
|
|
|
|
|
|
| 96 |
|
| 97 |
| Run | Accuracy | Key Change |
|
| 98 |
|-----|----------|------------|
|
| 99 |
| llm_run3 | 98.1% (Qwen3-8B) | QLoRA 4-bit, 871 examples |
|
| 100 |
-
|
|
| 101 |
|
| 102 |
## Quick Start
|
| 103 |
|
|
@@ -111,33 +109,15 @@ segments, info = model.transcribe("audio.wav", language="en", beam_size=5)
|
|
| 111 |
text = " ".join(seg.text.strip() for seg in segments)
|
| 112 |
```
|
| 113 |
|
| 114 |
-
###
|
| 115 |
-
|
| 116 |
-
```python
|
| 117 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 118 |
-
|
| 119 |
-
model = AutoModelForCausalLM.from_pretrained("./LLM", torch_dtype="auto", device_map="auto")
|
| 120 |
-
tokenizer = AutoTokenizer.from_pretrained("./LLM")
|
| 121 |
-
|
| 122 |
-
messages = [
|
| 123 |
-
{"role": "system", "content": "Convert the following air traffic control transcript into structured display text."},
|
| 124 |
-
{"role": "user", "content": "camel climb flight level zero nine zero"},
|
| 125 |
-
]
|
| 126 |
-
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
|
| 127 |
-
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
| 128 |
-
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.3, top_p=0.9, top_k=30)
|
| 129 |
-
result = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
## Download
|
| 133 |
|
| 134 |
```bash
|
| 135 |
-
# Full repo
|
| 136 |
huggingface-cli download aether-raid/astra-atc-models --local-dir ./models
|
| 137 |
|
| 138 |
-
# ASR only
|
| 139 |
huggingface-cli download aether-raid/astra-atc-models --include "ASR/*" --local-dir ./models
|
| 140 |
|
| 141 |
-
# LLM only
|
| 142 |
huggingface-cli download aether-raid/astra-atc-models --include "LLM/*" --local-dir ./models
|
| 143 |
```
|
|
|
|
| 17 |
|
| 18 |
# ASTRA ATC Models
|
| 19 |
|
| 20 |
+
Fine-tuned models for Singapore military air traffic control, built for the [ASTRA](https://github.com/aether-raid) training simulator.
|
| 21 |
+
|
| 22 |
+
## Pipeline
|
| 23 |
|
| 24 |
```
|
| 25 |
+
Audio --> VAD (Silero) --> ASR (Whisper) --> Rule Formatter --> Display Text
|
| 26 |
+
"camel climb flight level zero nine zero"
|
| 27 |
+
"CAMEL climb FL090"
|
| 28 |
```
|
| 29 |
|
| 30 |
+
The production pipeline uses a **rule-based formatter** (23 deterministic rules, <1ms, 0 VRAM) instead of the LLM. The LLM is retained for reference.
|
| 31 |
+
|
| 32 |
## Models
|
| 33 |
|
| 34 |
### [ASR/](./ASR) — Whisper Large v3 (CTranslate2 float16)
|
|
|
|
| 37 |
|
| 38 |
| Metric | Value |
|
| 39 |
|--------|-------|
|
| 40 |
+
| WER | **0.66%** |
|
| 41 |
+
| Base model | `openai/whisper-large-v3` |
|
| 42 |
| Size | 2.9 GB |
|
| 43 |
+
| Training | Full fine-tune with enhanced VHF radio augmentation |
|
| 44 |
+
|
| 45 |
+
### [LLM/](./LLM) — Qwen3-1.7B Display Formatter (Legacy)
|
| 46 |
|
| 47 |
+
> **Legacy.** Superseded by a deterministic rule-based formatter. Retained for reference.
|
| 48 |
|
| 49 |
+
Converts normalized ASR output into structured ATC display text.
|
| 50 |
|
| 51 |
| Metric | Value |
|
| 52 |
|--------|-------|
|
| 53 |
| Exact match | **100%** (161/161) |
|
| 54 |
| Base model | `unsloth/Qwen3-1.7B` |
|
| 55 |
| Size | 3.3 GB |
|
|
|
|
| 56 |
|
| 57 |
+
## Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
```
|
| 60 |
+
Audio --> VAD (Silero) --> ASR (Whisper ct2) --> Post-processing --> Rule Formatter --> Display Text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
```
|
| 62 |
|
| 63 |
+
| Component | Technology | Latency | VRAM |
|
| 64 |
+
|-----------|-----------|---------|------|
|
| 65 |
+
| VAD | Silero VAD (ONNX) | ~50ms | <100 MB |
|
| 66 |
+
| ASR | Whisper Large v3 (CTranslate2) | ~500ms-2s | ~2 GB |
|
| 67 |
+
| Formatter | 23 deterministic rules | <1ms | 0 MB |
|
| 68 |
+
|
| 69 |
+
Total VRAM: ~2 GB (ASR only).
|
| 70 |
|
| 71 |
## Domain
|
| 72 |
|
| 73 |
Singapore military ATC covering:
|
| 74 |
- **Airbases**: Tengah (WSAT, runway 18/36), Paya Lebar (WSAP, runway 02/20)
|
| 75 |
+
- **Aircraft**: F-16C/D, F-15SG, C-130, Hercules
|
| 76 |
+
- **Approaches**: ILS, GCA, PAR, TACAN, DVOR/DME, VOR/DME, Visual Straight-in
|
| 77 |
+
- **100+ callsigns**: CAMEL, NINJA, BEETLE, TAIPAN, MAVERICK, JAGUAR, LANCER, etc.
|
| 78 |
- **Categories**: departure, approach, handoff, maneuver, landing, emergency, ground, recovery, pilot reports, military-specific ops
|
| 79 |
|
| 80 |
## Training History
|
| 81 |
|
| 82 |
### ASR
|
| 83 |
|
| 84 |
+
| Run | WER | Base | Key Change |
|
| 85 |
+
|-----|-----|------|------------|
|
| 86 |
+
| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | Initial fine-tune |
|
| 87 |
+
| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | +augmentation, weight decay |
|
| 88 |
+
| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | Frozen encoder, +50 real recordings |
|
| 89 |
+
| **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain from base, enhanced augmentation |
|
| 90 |
|
| 91 |
+
> ct2_run8 trains from the original Whisper base for better generalisation to real-world ATC audio.
|
| 92 |
+
|
| 93 |
+
### LLM (Legacy)
|
| 94 |
|
| 95 |
| Run | Accuracy | Key Change |
|
| 96 |
|-----|----------|------------|
|
| 97 |
| llm_run3 | 98.1% (Qwen3-8B) | QLoRA 4-bit, 871 examples |
|
| 98 |
+
| llm_run4 | 100% (Qwen3-1.7B) | bf16 LoRA, 1,915 examples with ASR noise augmentation |
|
| 99 |
|
| 100 |
## Quick Start
|
| 101 |
|
|
|
|
| 109 |
text = " ".join(seg.text.strip() for seg in segments)
|
| 110 |
```
|
| 111 |
|
| 112 |
+
### Download
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
```bash
|
| 115 |
+
# Full repo (ASR + LLM)
|
| 116 |
huggingface-cli download aether-raid/astra-atc-models --local-dir ./models
|
| 117 |
|
| 118 |
+
# ASR only (recommended)
|
| 119 |
huggingface-cli download aether-raid/astra-atc-models --include "ASR/*" --local-dir ./models
|
| 120 |
|
| 121 |
+
# LLM only (legacy)
|
| 122 |
huggingface-cli download aether-raid/astra-atc-models --include "LLM/*" --local-dir ./models
|
| 123 |
```
|