astra-atc-models / ASR /hyperparameters.md

feat: update ASR model, mark LLM as legacy

f338e91 5 days ago

2.12 kB

	# Hyperparameters — Whisper ATC Fine-tune

	## Model

	\| Key \| Value \|
	\|-----\|-------\|
	\| Base model \| `openai/whisper-large-v3` \|
	\| Architecture \| Whisper Large v3 \|
	\| d_model \| 1280 \|
	\| Encoder layers \| 32 \|
	\| Decoder layers \| 32 \|
	\| Encoder attention heads \| 20 \|
	\| Decoder attention heads \| 20 \|
	\| Mel bins \| 128 \|

	## Training

	\| Key \| Value \|
	\|-----\|-------\|
	\| Optimizer \| AdamW (bitsandbytes 8-bit) \|
	\| Learning rate \| 1e-05 \|
	\| LR scheduler \| Linear \|
	\| Warmup ratio \| 0.05 \|
	\| Adam β₁ / β₂ / ε \| 0.9 / 0.999 / 1e-8 \|
	\| Weight decay \| 0.01 \|
	\| Per-device train batch size \| 1 \|
	\| Per-device eval batch size \| 8 \|
	\| Gradient accumulation steps \| 16 \|
	\| Effective batch size \| 16 \|
	\| Gradient checkpointing \| Yes (use_reentrant=False) \|
	\| Mixed precision \| fp16 \|
	\| Max grad norm \| 1.0 \|
	\| Max epochs (configured) \| 25 \|
	\| Early stop patience \| 5 epochs \|
	\| Label smoothing \| 0.0 \|
	\| Freeze encoder \| No \|
	\| Seed \| 42 \|

	## Augmentation

	- Gaussian noise (p=0.4, amplitude 0.001–0.015)
	- Time stretch (p=0.3, rate 0.9–1.1)
	- Random silence padding (p=0.5, 0–0.7s each end)
	- BandPassFilter (p=0.75, 300–3400 Hz, VHF radio simulation)
	- Clip (p=0.2, ±0.8)
	- Mp3Compression (p=0.3, 32–64 kbps)
	- SpecAugment: FrequencyMasking(freq\_mask\_param=27) + TimeMasking(time\_mask\_param=100, p=0.05)

	## Early stopping

	\| Key \| Value \|
	\|-----\|-------\|
	\| Metric \| WER (lower is better) \|
	\| Stopped at \| Step 6919 / Epoch 11 \|
	\| Patience \| 5 epochs \|

	## Results

	\| Epoch \| Eval loss \| WER \|
	\|-------\|-----------\|-----\|
	\| 1.0 \| 0.0496 \| 3.46% \|
	\| 2.0 \| 0.0288 \| 1.84% \|
	\| 3.0 \| 0.0239 \| 0.82% \|
	\| 4.0 \| 0.0245 \| 1.55% \|
	\| 5.0 \| 0.0195 \| 0.92% \|
	\| 6.0 \| 0.0231 \| 0.66% ← best \|
	\| 7.0 \| 0.0199 \| 0.70% \|
	\| 8.0 \| 0.0211 \| 2.62% \|
	\| 9.0 \| 0.0191 \| 0.72% \|
	\| 10.0 \| 0.0186 \| 4.43% \|
	\| 11.0 \| 0.0172 \| 0.69% \|

	Best checkpoint: `training/output_run8/checkpoint-3774` (epoch 6, WER 0.66%)

	## Output

	\| Key \| Value \|
	\|-----\|-------\|
	\| Best HF checkpoint \| `training/output_run8/best/` \|
	\| CTranslate2 model \| `training/saved_models/ct2_run8/` \|
	\| Quantization \| float16 \|
	\| Inference backend \| faster-whisper \|

	# Hyperparameters — Whisper ATC Fine-tune

	## Model

	\| Key \| Value \|
	\|-----\|-------\|
	\| Base model \| `openai/whisper-large-v3` \|
	\| Architecture \| Whisper Large v3 \|
	\| d_model \| 1280 \|
	\| Encoder layers \| 32 \|
	\| Decoder layers \| 32 \|
	\| Encoder attention heads \| 20 \|
	\| Decoder attention heads \| 20 \|
	\| Mel bins \| 128 \|

	## Training

	\| Key \| Value \|
	\|-----\|-------\|
	\| Optimizer \| AdamW (bitsandbytes 8-bit) \|
	\| Learning rate \| 1e-05 \|
	\| LR scheduler \| Linear \|
	\| Warmup ratio \| 0.05 \|
	\| Adam β₁ / β₂ / ε \| 0.9 / 0.999 / 1e-8 \|
	\| Weight decay \| 0.01 \|
	\| Per-device train batch size \| 1 \|
	\| Per-device eval batch size \| 8 \|
	\| Gradient accumulation steps \| 16 \|
	\| Effective batch size \| 16 \|
	\| Gradient checkpointing \| Yes (use_reentrant=False) \|
	\| Mixed precision \| fp16 \|
	\| Max grad norm \| 1.0 \|
	\| Max epochs (configured) \| 25 \|
	\| Early stop patience \| 5 epochs \|
	\| Label smoothing \| 0.0 \|
	\| Freeze encoder \| No \|
	\| Seed \| 42 \|

	## Augmentation

	- Gaussian noise (p=0.4, amplitude 0.001–0.015)
	- Time stretch (p=0.3, rate 0.9–1.1)
	- Random silence padding (p=0.5, 0–0.7s each end)
	- BandPassFilter (p=0.75, 300–3400 Hz, VHF radio simulation)
	- Clip (p=0.2, ±0.8)
	- Mp3Compression (p=0.3, 32–64 kbps)
	- SpecAugment: FrequencyMasking(freq\_mask\_param=27) + TimeMasking(time\_mask\_param=100, p=0.05)

	## Early stopping

	\| Key \| Value \|
	\|-----\|-------\|
	\| Metric \| WER (lower is better) \|
	\| Stopped at \| Step 6919 / Epoch 11 \|
	\| Patience \| 5 epochs \|

	## Results

	\| Epoch \| Eval loss \| WER \|
	\|-------\|-----------\|-----\|
	\| 1.0 \| 0.0496 \| 3.46% \|
	\| 2.0 \| 0.0288 \| 1.84% \|
	\| 3.0 \| 0.0239 \| 0.82% \|
	\| 4.0 \| 0.0245 \| 1.55% \|
	\| 5.0 \| 0.0195 \| 0.92% \|
	\| 6.0 \| 0.0231 \| 0.66% ← best \|
	\| 7.0 \| 0.0199 \| 0.70% \|
	\| 8.0 \| 0.0211 \| 2.62% \|
	\| 9.0 \| 0.0191 \| 0.72% \|
	\| 10.0 \| 0.0186 \| 4.43% \|
	\| 11.0 \| 0.0172 \| 0.69% \|

	Best checkpoint: `training/output_run8/checkpoint-3774` (epoch 6, WER 0.66%)

	## Output

	\| Key \| Value \|
	\|-----\|-------\|
	\| Best HF checkpoint \| `training/output_run8/best/` \|
	\| CTranslate2 model \| `training/saved_models/ct2_run8/` \|
	\| Quantization \| float16 \|
	\| Inference backend \| faster-whisper \|