astra-atc-models / ASR /whisper /hyperparameters.md

feat: rename model

01f9953 17 days ago

2.62 kB

	# Hyperparameters — Whisper ATC Fine-tune (Run 9)

	## Model

	\| Key \| Value \|
	\|-----\|-------\|
	\| Base model \| `openai/whisper-large-v3` \|
	\| Architecture \| Whisper Large v3 \|
	\| d_model \| 1280 \|
	\| Encoder layers \| 32 \|
	\| Decoder layers \| 32 \|
	\| Encoder attention heads \| 20 \|
	\| Decoder attention heads \| 20 \|
	\| Mel bins \| 128 \|

	## Training

	\| Key \| Value \|
	\|-----\|-------\|
	\| Optimizer \| AdamW (bitsandbytes 8-bit) \|
	\| Learning rate \| 1e-05 \|
	\| LR scheduler \| Linear \|
	\| Warmup ratio \| 0.05 \|
	\| Adam β₁ / β₂ / ε \| 0.9 / 0.999 / 1e-8 \|
	\| Weight decay \| 0.01 \|
	\| Per-device train batch size \| 1 \|
	\| Per-device eval batch size \| 8 \|
	\| Gradient accumulation steps \| 16 \|
	\| Effective batch size \| 16 \|
	\| Gradient checkpointing \| Yes (use_reentrant=False) \|
	\| Mixed precision \| fp16 \|
	\| Max grad norm \| 1.0 \|
	\| Max epochs (configured) \| 30 \|
	\| Early stop patience \| 7 epochs \|
	\| Label smoothing \| 0.0 \|
	\| Freeze encoder \| No \|
	\| Seed \| 42 \|

	## Data Sources

	\| Source \| Role \| Size \|
	\|--------\|------\|------\|
	\| axite_all.json \| SG military ATC synthetic (4 voices + human) \| ~15,716 \|
	\| deepdml/conversations \| Real Singapore Changi ATC VHF radio \| ~1,443 \|
	\| mnsc-part1-test \| MNSC SG-accented read speech \| ~3,000 \|

	## Augmentation

	- Gaussian noise (p=0.4, amplitude 0.001–0.015)
	- Time stretch (p=0.3, rate 0.9–1.1)
	- Random silence padding (p=0.5, 0–0.7s each end)
	- BandPassFilter (p=0.75, 300–3400 Hz, VHF radio simulation)
	- Clip (p=0.2, ±0.8)
	- Mp3Compression (p=0.3, 32–64 kbps)
	- SpecAugment: FrequencyMasking(freq\_mask\_param=27) + TimeMasking(time\_mask\_param=100, p=0.05)

	## Early stopping

	\| Key \| Value \|
	\|-----\|-------\|
	\| Metric \| WER (lower is better) \|
	\| Stopped at \| Step 21185 / Epoch 19 \|
	\| Patience \| 7 epochs \|

	## Results

	\| Epoch \| Eval loss \| WER \|
	\|-------\|-----------\|-----\|
	\| 1.0 \| 0.0838 \| 11.46% \|
	\| 2.0 \| 0.0550 \| 4.28% \|
	\| 3.0 \| 0.0406 \| 2.79% \|
	\| 4.0 \| 0.0417 \| 6.58% \|
	\| 5.0 \| 0.0381 \| 5.46% \|
	\| 6.0 \| 0.0372 \| 3.27% \|
	\| 7.0 \| 0.0375 \| 1.39% \|
	\| 8.0 \| 0.0381 \| 5.52% \|
	\| 9.0 \| 0.0188 \| 0.83% \|
	\| 10.0 \| 0.0202 \| 0.84% \|
	\| 11.0 \| 0.0185 \| 1.05% \|
	\| 12.0 \| 0.0189 \| 0.82% ← best \|
	\| 13.0 \| 0.0189 \| 0.95% \|
	\| 14.0 \| 0.0202 \| 1.19% \|
	\| 15.0 \| 0.0206 \| 0.91% \|
	\| 16.0 \| 0.0191 \| 1.16% \|
	\| 17.0 \| 0.0169 \| 1.12% \|
	\| 18.0 \| 0.0176 \| 1.19% \|
	\| 19.0 \| 0.0185 \| 1.19% \|

	Best checkpoint: `training/output_run9/checkpoint-13380` (epoch 12, WER 0.82%)

	## Output

	\| Key \| Value \|
	\|-----\|-------\|
	\| Best HF checkpoint \| `training/output_run9/best/` \|
	\| CTranslate2 model \| `training/saved_models/ct2_run9/` \|
	\| Quantization \| float16 \|
	\| Inference backend \| faster-whisper \|

	# Hyperparameters — Whisper ATC Fine-tune (Run 9)

	## Model

	\| Key \| Value \|
	\|-----\|-------\|
	\| Base model \| `openai/whisper-large-v3` \|
	\| Architecture \| Whisper Large v3 \|
	\| d_model \| 1280 \|
	\| Encoder layers \| 32 \|
	\| Decoder layers \| 32 \|
	\| Encoder attention heads \| 20 \|
	\| Decoder attention heads \| 20 \|
	\| Mel bins \| 128 \|

	## Training

	\| Key \| Value \|
	\|-----\|-------\|
	\| Optimizer \| AdamW (bitsandbytes 8-bit) \|
	\| Learning rate \| 1e-05 \|
	\| LR scheduler \| Linear \|
	\| Warmup ratio \| 0.05 \|
	\| Adam β₁ / β₂ / ε \| 0.9 / 0.999 / 1e-8 \|
	\| Weight decay \| 0.01 \|
	\| Per-device train batch size \| 1 \|
	\| Per-device eval batch size \| 8 \|
	\| Gradient accumulation steps \| 16 \|
	\| Effective batch size \| 16 \|
	\| Gradient checkpointing \| Yes (use_reentrant=False) \|
	\| Mixed precision \| fp16 \|
	\| Max grad norm \| 1.0 \|
	\| Max epochs (configured) \| 30 \|
	\| Early stop patience \| 7 epochs \|
	\| Label smoothing \| 0.0 \|
	\| Freeze encoder \| No \|
	\| Seed \| 42 \|

	## Data Sources

	\| Source \| Role \| Size \|
	\|--------\|------\|------\|
	\| axite_all.json \| SG military ATC synthetic (4 voices + human) \| ~15,716 \|
	\| deepdml/conversations \| Real Singapore Changi ATC VHF radio \| ~1,443 \|
	\| mnsc-part1-test \| MNSC SG-accented read speech \| ~3,000 \|

	## Augmentation

	- Gaussian noise (p=0.4, amplitude 0.001–0.015)
	- Time stretch (p=0.3, rate 0.9–1.1)
	- Random silence padding (p=0.5, 0–0.7s each end)
	- BandPassFilter (p=0.75, 300–3400 Hz, VHF radio simulation)
	- Clip (p=0.2, ±0.8)
	- Mp3Compression (p=0.3, 32–64 kbps)
	- SpecAugment: FrequencyMasking(freq\_mask\_param=27) + TimeMasking(time\_mask\_param=100, p=0.05)

	## Early stopping

	\| Key \| Value \|
	\|-----\|-------\|
	\| Metric \| WER (lower is better) \|
	\| Stopped at \| Step 21185 / Epoch 19 \|
	\| Patience \| 7 epochs \|

	## Results

	\| Epoch \| Eval loss \| WER \|
	\|-------\|-----------\|-----\|
	\| 1.0 \| 0.0838 \| 11.46% \|
	\| 2.0 \| 0.0550 \| 4.28% \|
	\| 3.0 \| 0.0406 \| 2.79% \|
	\| 4.0 \| 0.0417 \| 6.58% \|
	\| 5.0 \| 0.0381 \| 5.46% \|
	\| 6.0 \| 0.0372 \| 3.27% \|
	\| 7.0 \| 0.0375 \| 1.39% \|
	\| 8.0 \| 0.0381 \| 5.52% \|
	\| 9.0 \| 0.0188 \| 0.83% \|
	\| 10.0 \| 0.0202 \| 0.84% \|
	\| 11.0 \| 0.0185 \| 1.05% \|
	\| 12.0 \| 0.0189 \| 0.82% ← best \|
	\| 13.0 \| 0.0189 \| 0.95% \|
	\| 14.0 \| 0.0202 \| 1.19% \|
	\| 15.0 \| 0.0206 \| 0.91% \|
	\| 16.0 \| 0.0191 \| 1.16% \|
	\| 17.0 \| 0.0169 \| 1.12% \|
	\| 18.0 \| 0.0176 \| 1.19% \|
	\| 19.0 \| 0.0185 \| 1.19% \|

	Best checkpoint: `training/output_run9/checkpoint-13380` (epoch 12, WER 0.82%)

	## Output

	\| Key \| Value \|
	\|-----\|-------\|
	\| Best HF checkpoint \| `training/output_run9/best/` \|
	\| CTranslate2 model \| `training/saved_models/ct2_run9/` \|
	\| Quantization \| float16 \|
	\| Inference backend \| faster-whisper \|