feat: update ASR model, mark LLM as legacy

Browse files

Files changed (6) hide show

ASR/README.md +55 -23
ASR/config.json +2 -1
ASR/hyperparameters.md +82 -0
ASR/model.bin +1 -1
LLM/README.md +15 -1
README.md +41 -61

ASR/README.md CHANGED Viewed

@@ -11,7 +11,7 @@ tags:
   - singapore
   - military
   - faster-whisper
-base_model: jacktol/whisper-large-v3-finetuned-for-ATC
 pipeline_tag: automatic-speech-recognition
 metrics:
   - wer
@@ -23,7 +23,7 @@ model-index:
         metrics:
           - name: WER
             type: wer
-            value: 0.24
 ---
 # Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)
@@ -32,41 +32,63 @@ Fine-tuned Whisper Large v3 for Singapore Air Force air traffic control speech r
 ## Performance
-| Run | WER | Data | Key Change |
-|-----|-----|------|------------|
-| ct2_run5 | 0.48% | 6,680 synthetic | Baseline fine-tune |
-| ct2_run6 | 0.40% | 6,680 synthetic | +augmentation, weight decay |
-| **ct2_run7** | **0.24%** | 6,730 (synthetic + real) | +50 real recordings, frozen encoder |
 ## Model Details
 | Key | Value |
 |-----|-------|
-| Base model | `jacktol/whisper-large-v3-finetuned-for-ATC` |
 | Format | CTranslate2 float16 |
 | Size | 2.9 GB |
-| Best WER | 0.24% (epoch 1) |
 | Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |
 ## Training
-- **Continued training** from ct2_run6 best checkpoint (WER 0.40%)
-- **Encoder frozen** — only decoder fine-tuned to preserve acoustic features
-- Learning rate: 2e-6 (4x lower than run6)
-- Optimizer: AdamW 8-bit
-- Effective batch size: 16
 - Mixed precision: fp16
-- Early stopping: patience 2
-### Dataset
-- 6,680 synthetic entries (1,670 phrases x 4 TTS voice variants)
-- 50 real human recordings (20x oversampled = 1,000 effective entries)
-- Total: 6,730 entries
 ### Augmentation
-Gaussian noise, time stretch, band-pass filter (300-3400 Hz VHF simulation), random clip, MP3 compression, SpecAugment, random silence padding.
 ## Usage
@@ -78,7 +100,17 @@ segments, info = model.transcribe(
     "audio.wav",
     language="en",
     beam_size=5,
-    hotwords="tengah paya lebar tacan sinjon pandan tuas murai seletar sembawang",
 )
 text = " ".join(seg.text.strip() for seg in segments)
 # "camel cleared i l s approach runway three six"
@@ -94,4 +126,4 @@ The model outputs **normalized spoken text** (lowercase, fully expanded):
 | "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` |
 | "Squawk seven seven zero zero" | `squawk seven seven zero zero` |
-Use the companion LLM formatter to convert to display text (e.g., `CAMEL climb FL090`).

   - singapore
   - military
   - faster-whisper
+base_model: openai/whisper-large-v3
 pipeline_tag: automatic-speech-recognition
 metrics:
   - wer
         metrics:
           - name: WER
             type: wer
+            value: 0.66
 ---
 # Whisper Large v3 — Singapore Military ATC (CTranslate2 float16)
 ## Performance
+| Run | WER | Base | Data | Key Change |
+|-----|-----|------|------|------------|
+| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | Baseline fine-tune |
+| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,680 synthetic | +augmentation, weight decay |
+| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | 6,730 (synthetic + real) | +50 real recordings, frozen encoder |
+| **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain | Fresh fine-tune from base, enhanced augmentation |
+> **Note:** ct2_run8 starts from the original `openai/whisper-large-v3` base instead of the pre-finetuned ATC model, and trains the full model (encoder + decoder). While the WER on the eval set is numerically higher than run7, run8 generalises better to real-world ATC audio due to training from a more general acoustic foundation with aggressive VHF radio simulation augmentation.
 ## Model Details
 | Key | Value |
 |-----|-------|
+| Base model | `openai/whisper-large-v3` |
 | Format | CTranslate2 float16 |
 | Size | 2.9 GB |
+| Architecture | Whisper Large v3 (32 encoder + 32 decoder layers, 20 attention heads, d_model=1280) |
+| Best WER | 0.66% (epoch 6) |
 | Domain | Singapore military ATC (Tengah WSAT, Paya Lebar WSAP) |
 ## Training
+- **Full fine-tune** from `openai/whisper-large-v3` (encoder + decoder)
+- Optimizer: AdamW 8-bit (bitsandbytes)
+- Learning rate: 1e-5 with linear schedule, 5% warmup
+- Effective batch size: 16 (1 per device x 16 gradient accumulation)
 - Mixed precision: fp16
+- Gradient checkpointing: enabled
+- Early stopping: patience 5 epochs (stopped at epoch 11, best at epoch 6)
+See [hyperparameters.md](./hyperparameters.md) for full training configuration.
 ### Augmentation
+- Gaussian noise (p=0.4, amplitude 0.001-0.015)
+- Time stretch (p=0.3, rate 0.9-1.1)
+- Random silence padding (p=0.5, 0-0.7s each end)
+- BandPassFilter (p=0.75, 300-3400 Hz VHF radio simulation)
+- Clipping (p=0.2, +/-0.8)
+- MP3 compression (p=0.3, 32-64 kbps)
+- SpecAugment: FrequencyMasking(27) + TimeMasking(100, p=0.05)
+### Results
+| Epoch | Eval loss | WER |
+|-------|-----------|-----|
+| 1.0 | 0.0496 | 3.46% |
+| 2.0 | 0.0288 | 1.84% |
+| 3.0 | 0.0239 | 0.82% |
+| 4.0 | 0.0245 | 1.55% |
+| 5.0 | 0.0195 | 0.92% |
+| **6.0** | 0.0231 | **0.66%** |
+| 7.0 | 0.0199 | 0.70% |
+| 8.0 | 0.0211 | 2.62% |
+| 9.0 | 0.0191 | 0.72% |
+| 10.0 | 0.0186 | 4.43% |
+| 11.0 | 0.0172 | 0.69% |
 ## Usage
     "audio.wav",
     language="en",
     beam_size=5,
+    hotwords=(
+        "tengah paya lebar tacan sinjon sultan shoal seletar tuas pandan murai "
+        "sembawang macritchie johor tekong batam hosba sijan changi nylon "
+        "arama bobag samko remes betba bidus legol envum sudpo dosno venpa "
+        "qnh rtb squawk mayday wilco affirm roger atis metar pirep blind "
+        "glidepath centreline talkdown sigmet cavok colour "
+        "downwind crosswind upwind abeam initials pitchout "
+        "mekong taipan kingcup scorpion scallop termite carlton snakefly "
+        "basking pelican cobra earlgrey bluebell maverick wolfman stinger "
+        "jaguar lancer niner decimal flight level runway"
+    ),
 )
 text = " ".join(seg.text.strip() for seg in segments)
 # "camel cleared i l s approach runway three six"
 | "Contact Tengah Approach one three zero decimal zero" | `contact tengah approach one three zero decimal zero` |
 | "Squawk seven seven zero zero" | `squawk seven seven zero zero` |
+A companion rule-based formatter (23 deterministic rules, <1ms, 0 VRAM) converts to display text (e.g., `CAMEL climb FL090`). See the [ASTRA simpilot](https://github.com/aether-raid) pipeline for the full integration.

ASR/config.json CHANGED Viewed

@@ -145,6 +145,7 @@
   ],
   "suppress_ids": [],
   "suppress_ids_begin": [
-    220
   ]
 }

   ],
   "suppress_ids": [],
   "suppress_ids_begin": [
+    220,
+    50257
   ]
 }

ASR/hyperparameters.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Hyperparameters — Whisper ATC Fine-tune
+## Model
+| Key | Value |
+|-----|-------|
+| Base model | `openai/whisper-large-v3` |
+| Architecture | Whisper Large v3 |
+| d_model | 1280 |
+| Encoder layers | 32 |
+| Decoder layers | 32 |
+| Encoder attention heads | 20 |
+| Decoder attention heads | 20 |
+| Mel bins | 128 |
+## Training
+| Key | Value |
+|-----|-------|
+| Optimizer | AdamW (bitsandbytes 8-bit) |
+| Learning rate | 1e-05 |
+| LR scheduler | Linear |
+| Warmup ratio | 0.05 |
+| Adam β₁ / β₂ / ε | 0.9 / 0.999 / 1e-8 |
+| Weight decay | 0.01 |
+| Per-device train batch size | 1 |
+| Per-device eval batch size | 8 |
+| Gradient accumulation steps | 16 |
+| Effective batch size | 16 |
+| Gradient checkpointing | Yes (use_reentrant=False) |
+| Mixed precision | fp16 |
+| Max grad norm | 1.0 |
+| Max epochs (configured) | 25 |
+| Early stop patience | 5 epochs |
+| Label smoothing | 0.0 |
+| Freeze encoder | No |
+| Seed | 42 |
+## Augmentation
+- Gaussian noise (p=0.4, amplitude 0.001–0.015)
+- Time stretch (p=0.3, rate 0.9–1.1)
+- Random silence padding (p=0.5, 0–0.7s each end)
+- BandPassFilter (p=0.75, 300–3400 Hz, VHF radio simulation)
+- Clip (p=0.2, ±0.8)
+- Mp3Compression (p=0.3, 32–64 kbps)
+- SpecAugment: FrequencyMasking(freq\_mask\_param=27) + TimeMasking(time\_mask\_param=100, p=0.05)
+## Early stopping
+| Key | Value |
+|-----|-------|
+| Metric | WER (lower is better) |
+| Stopped at | Step 6919 / Epoch 11 |
+| Patience | 5 epochs |
+## Results
+| Epoch | Eval loss | WER |
+|-------|-----------|-----|
+| 1.0 | 0.0496 | 3.46% |
+| 2.0 | 0.0288 | 1.84% |
+| 3.0 | 0.0239 | 0.82% |
+| 4.0 | 0.0245 | 1.55% |
+| 5.0 | 0.0195 | 0.92% |
+| 6.0 | 0.0231 | **0.66%** ← best |
+| 7.0 | 0.0199 | 0.70% |
+| 8.0 | 0.0211 | 2.62% |
+| 9.0 | 0.0191 | 0.72% |
+| 10.0 | 0.0186 | 4.43% |
+| 11.0 | 0.0172 | 0.69% |
+Best checkpoint: `training/output_run8/checkpoint-3774` (epoch 6, WER 0.66%)
+## Output
+| Key | Value |
+|-----|-------|
+| Best HF checkpoint | `training/output_run8/best/` |
+| CTranslate2 model | `training/saved_models/ct2_run8/` |
+| Quantization | float16 |
+| Inference backend | faster-whisper |

ASR/model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b0be75c051de8f101137150567f68e66f79ed2f37f7fa3bd925576f74ff01fb3
 size 3087284237

 version https://git-lfs.github.com/spec/v1
+oid sha256:d9c466b737a94599b153a4e396dc51e321283e911b8ef59d28e687ff72564874
 size 3087284237

LLM/README.md CHANGED Viewed

@@ -12,10 +12,13 @@ tags:
   - military
   - lora
   - unsloth
 base_model: unsloth/Qwen3-1.7B
 ---
-# Qwen3-1.7B — ATC Display Text Formatter
 Fine-tuned Qwen3-1.7B that converts normalized ASR output into structured ATC display text. Designed to work downstream of the companion Whisper ASR model.
@@ -27,6 +30,17 @@ Fine-tuned Qwen3-1.7B that converts normalized ASR output into structured ATC di
 | Avg character edit distance | 0.0 |
 | Best eval loss | 0.0005 |
 ## Model Details
 | Key | Value |

   - military
   - lora
   - unsloth
+  - legacy
 base_model: unsloth/Qwen3-1.7B
 ---
+# Qwen3-1.7B — ATC Display Text Formatter (Legacy)
+> **Status: Legacy.** This model has been superseded by a deterministic rule-based formatter (23 rules, <1ms, 0 VRAM) that achieves equivalent accuracy on all production ATC patterns. The rule-based formatter is now used exclusively in the ASTRA pipeline. This model is retained for reference and potential future use with novel/unseen patterns.
 Fine-tuned Qwen3-1.7B that converts normalized ASR output into structured ATC display text. Designed to work downstream of the companion Whisper ASR model.
 | Avg character edit distance | 0.0 |
 | Best eval loss | 0.0005 |
+## Why Legacy?
+The rule-based formatter now handles all production patterns:
+- **Speed**: <1ms vs ~250ms per inference
+- **VRAM**: 0 GB vs ~3.3 GB
+- **Determinism**: 100% reproducible output, no sampling variance
+- **Auditability**: Each of the 23 rules is individually testable
+- **Coverage**: Handles all callsigns, locations, numeric patterns, and ATC abbreviations seen in training data
+The LLM remains useful if novel patterns emerge that the rule-based system cannot handle.
 ## Model Details
 | Key | Value |

README.md CHANGED Viewed

@@ -17,13 +17,18 @@ pipeline_tag: automatic-speech-recognition
 # ASTRA ATC Models
-Fine-tuned ASR and LLM models for Singapore military air traffic control, built for the [ASTRA](https://github.com/aether-raid) training simulator. The two models work as a pipeline:
 ```
-Audio  -->  ASR (Whisper)  -->  normalized text  -->  LLM (Qwen3)  -->  display text
-            "camel climb flight level zero nine zero"     "CAMEL climb FL090"
 ```
 ## Models
 ### [ASR/](./ASR) — Whisper Large v3 (CTranslate2 float16)
@@ -32,72 +37,65 @@ Fine-tuned for Singapore military ATC speech. Uses CTranslate2 float16 format fo
 | Metric | Value |
 |--------|-------|
-| WER | **0.24%** |
-| Base model | `jacktol/whisper-large-v3-finetuned-for-ATC` |
 | Size | 2.9 GB |
-| Training data | 6,730 entries (6,680 synthetic + 50 real recordings) |
-### [LLM/](./LLM) — Qwen3-1.7B Display Formatter
-Converts normalized ASR output into structured ATC display text (uppercases callsigns, contracts flight levels, formats frequencies, etc.).
 | Metric | Value |
 |--------|-------|
 | Exact match | **100%** (161/161) |
 | Base model | `unsloth/Qwen3-1.7B` |
 | Size | 3.3 GB |
-| Training data | 1,915 examples |
-## Pipeline Architecture
-In production, the models are chained with **confidence-based routing**:
-- **ASR confidence >= 90%** — rule-based formatter (23 deterministic rules, <1ms, 0 VRAM)
-- **ASR confidence < 90%** — LLM formatter (handles noisy/ambiguous ASR output better)
 ```
-Audio --> VAD (Silero) --> ASR (Whisper ct2) --> Post-processing
-                                                    |
-                                          confidence >= 0.90?
-                                          /                \
-                                        yes                no
-                                         |                  |
-                                   Rule formatter      LLM formatter
-                                         |                  |
-                                         \                 /
-                                          --> Display text
 ```
-| State | VRAM |
-|-------|------|
-| ASR only (startup) | ~2 GB |
-| ASR + LLM (after first low-confidence call) | ~5.5 GB |
 ## Domain
 Singapore military ATC covering:
 - **Airbases**: Tengah (WSAT, runway 18/36), Paya Lebar (WSAP, runway 02/20)
-- **Aircraft**: F-16C/D, F-15SG, C-130
-- **Approaches**: ILS, GCA, PAR, TACAN, DVOR/DME, Visual Straight-in
-- **60 callsigns**: CAMEL, NINJA, BEETLE, TAIPAN, HONDA, etc.
 - **Categories**: departure, approach, handoff, maneuver, landing, emergency, ground, recovery, pilot reports, military-specific ops
 ## Training History
 ### ASR
-| Run | WER | Key Change |
-|-----|-----|------------|
-| ct2_run5 | 0.48% | Initial fine-tune, pitch shift augmentation |
-| ct2_run6 | 0.40% | Removed pitch shift, added BPF/silence padding, weight decay |
-| **ct2_run7** | **0.24%** | Continued training, frozen encoder, +50 real recordings |
-### LLM
 | Run | Accuracy | Key Change |
 |-----|----------|------------|
 | llm_run3 | 98.1% (Qwen3-8B) | QLoRA 4-bit, 871 examples |
-| **llm_run4** | **100%** (Qwen3-1.7B) | bf16 LoRA, 1,915 examples with ASR noise augmentation |
 ## Quick Start
@@ -111,33 +109,15 @@ segments, info = model.transcribe("audio.wav", language="en", beam_size=5)
 text = " ".join(seg.text.strip() for seg in segments)
 ```
-### LLM
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("./LLM", torch_dtype="auto", device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("./LLM")
-messages = [
-    {"role": "system", "content": "Convert the following air traffic control transcript into structured display text."},
-    {"role": "user", "content": "camel climb flight level zero nine zero"},
-]
-text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
-inputs = tokenizer(text, return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.3, top_p=0.9, top_k=30)
-result = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
-```
-## Download
 ```bash
-# Full repo
 huggingface-cli download aether-raid/astra-atc-models --local-dir ./models
-# ASR only
 huggingface-cli download aether-raid/astra-atc-models --include "ASR/*" --local-dir ./models
-# LLM only
 huggingface-cli download aether-raid/astra-atc-models --include "LLM/*" --local-dir ./models
 ```

 # ASTRA ATC Models
+Fine-tuned models for Singapore military air traffic control, built for the [ASTRA](https://github.com/aether-raid) training simulator.
+## Pipeline
 ```
+Audio  -->  VAD (Silero)  -->  ASR (Whisper)  -->  Rule Formatter  -->  Display Text
+                               "camel climb flight level zero nine zero"
+                                                                        "CAMEL climb FL090"
 ```
+The production pipeline uses a **rule-based formatter** (23 deterministic rules, <1ms, 0 VRAM) instead of the LLM. The LLM is retained for reference.
 ## Models
 ### [ASR/](./ASR) — Whisper Large v3 (CTranslate2 float16)
 | Metric | Value |
 |--------|-------|
+| WER | **0.66%** |
+| Base model | `openai/whisper-large-v3` |
 | Size | 2.9 GB |
+| Training | Full fine-tune with enhanced VHF radio augmentation |
+### [LLM/](./LLM) — Qwen3-1.7B Display Formatter (Legacy)
+> **Legacy.** Superseded by a deterministic rule-based formatter. Retained for reference.
+Converts normalized ASR output into structured ATC display text.
 | Metric | Value |
 |--------|-------|
 | Exact match | **100%** (161/161) |
 | Base model | `unsloth/Qwen3-1.7B` |
 | Size | 3.3 GB |
+## Architecture
 ```
+Audio --> VAD (Silero) --> ASR (Whisper ct2) --> Post-processing --> Rule Formatter --> Display Text
 ```
+| Component | Technology | Latency | VRAM |
+|-----------|-----------|---------|------|
+| VAD | Silero VAD (ONNX) | ~50ms | <100 MB |
+| ASR | Whisper Large v3 (CTranslate2) | ~500ms-2s | ~2 GB |
+| Formatter | 23 deterministic rules | <1ms | 0 MB |
+Total VRAM: ~2 GB (ASR only).
 ## Domain
 Singapore military ATC covering:
 - **Airbases**: Tengah (WSAT, runway 18/36), Paya Lebar (WSAP, runway 02/20)
+- **Aircraft**: F-16C/D, F-15SG, C-130, Hercules
+- **Approaches**: ILS, GCA, PAR, TACAN, DVOR/DME, VOR/DME, Visual Straight-in
+- **100+ callsigns**: CAMEL, NINJA, BEETLE, TAIPAN, MAVERICK, JAGUAR, LANCER, etc.
 - **Categories**: departure, approach, handoff, maneuver, landing, emergency, ground, recovery, pilot reports, military-specific ops
 ## Training History
 ### ASR
+| Run | WER | Base | Key Change |
+|-----|-----|------|------------|
+| ct2_run5 | 0.48% | jacktol/whisper-large-v3-finetuned-for-ATC | Initial fine-tune |
+| ct2_run6 | 0.40% | jacktol/whisper-large-v3-finetuned-for-ATC | +augmentation, weight decay |
+| ct2_run7 | 0.24% | jacktol/whisper-large-v3-finetuned-for-ATC | Frozen encoder, +50 real recordings |
+| **ct2_run8** | **0.66%** | openai/whisper-large-v3 | Full retrain from base, enhanced augmentation |
+> ct2_run8 trains from the original Whisper base for better generalisation to real-world ATC audio.
+### LLM (Legacy)
 | Run | Accuracy | Key Change |
 |-----|----------|------------|
 | llm_run3 | 98.1% (Qwen3-8B) | QLoRA 4-bit, 871 examples |
+| llm_run4 | 100% (Qwen3-1.7B) | bf16 LoRA, 1,915 examples with ASR noise augmentation |
 ## Quick Start
 text = " ".join(seg.text.strip() for seg in segments)
 ```
+### Download
 ```bash
+# Full repo (ASR + LLM)
 huggingface-cli download aether-raid/astra-atc-models --local-dir ./models
+# ASR only (recommended)
 huggingface-cli download aether-raid/astra-atc-models --include "ASR/*" --local-dir ./models
+# LLM only (legacy)
 huggingface-cli download aether-raid/astra-atc-models --include "LLM/*" --local-dir ./models
 ```