Whisper American ATC - Fine-tuned for US Air Traffic Control
Model Description
This model is a fine-tuned version of jlvdoorn/whisper-large-v3-atco2-asr adapted for American Air Traffic Control (ATC) radio communications. The base model was trained on European ATCO2 data; this version has been specialized for American ATC accents, phraseology, and radio characteristics.
Developed by: Jeffrey Suu
Model type: Automatic Speech Recognition (ASR)
Language: English (American ATC)
License: Apache 2.0
Finetuned from: jlvdoorn/whisper-large-v3-atco2-asr
Intended Use
Direct Use
This model is designed for transcribing American Air Traffic Control radio communications, including:
- LiveATC recordings from US airports (IAH, JFK, SFO, etc.)
- Pilot-controller communications
- Ground control, tower, and approach frequencies
Out-of-Scope Use
- Non-ATC aviation audio
- Non-American English accents
- General-purpose speech recognition
- Safety-critical real-time ATC systems without human oversight
Training Details
Training Data
- Source: LiveATC recordings from Houston IAH, New York JFK, San Francisco SFO
- Size: 55 original clips (
6 minutes), augmented to 275 samples (31 minutes) - Preprocessing:
- Bandpass filtered (300-3400 Hz) to simulate ATC radio frequency response
- Volume normalized
- 5x data augmentation (time stretch, pitch shift, noise, gain)
Training Procedure
- Training regime: Full fine-tuning (fp32)
- Learning rate: 5e-6
- Batch size: 4 (effective: 16 with gradient accumulation)
- Epochs: 5
- Hardware: Google Colab Tesla T4 (15GB VRAM)
- Training time: ~25 minutes
Evaluation
Metrics
Word Error Rate (WER) on validation set:
| Model | WER |
|---|---|
| Base (European ATCO2) | 30.3% |
| This model (American ATC) | 13.7% |
| Improvement | -16.6% |
Key Improvements
โ
Correctly transcribes American number formatting (e.g., "1503" not "Fifteen Zero Three")
โ
Better handling of American accents and speech patterns
โ
Improved recognition of US-specific callsigns and airports
โ
Preserves numeric frequencies (e.g., "135.15" not "one three five one five")
How to Use
from transformers import pipeline
# Load model
transcriber = pipeline(
"automatic-speech-recognition",
model="jeffreysuu/whisper-american-atc",
device=0 # Use GPU if available
)
# Transcribe audio
result = transcriber("path/to/atc_audio.wav")
print(result["text"])
Limitations and Bias
- Limited training data: Fine-tuned on only 275 samples from 3 US airports
- Airport bias: Best performance on IAH, JFK, SFO; may vary on other airports
- Accent coverage: Primarily trained on American controllers; performance on non-American accents unknown
- Not production-ready: Requires human verification for safety-critical applications
Technical Specifications
Model Architecture
- Base: OpenAI Whisper Large v3 (1.5B parameters)
- Encoder: Audio โ Log-mel spectrogram
- Decoder: Transformer-based text generation
Compute Infrastructure
- Hardware: NVIDIA Tesla T4 (Google Colab)
- Software:
- Hugging Face Transformers 4.57.3
- Python 3.12
- PyTorch
Citation
If you use this model, please cite the original Whisper paper and ATCO2 work:
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year={2022}
}
Contact
Model Card Author: Jeffrey Suu
GitHub: [Your GitHub]
Email: [Your Email]
For issues or questions about this model, please open an issue on the model repository.
- Downloads last month
- 30
Evaluation results
- Word Error Rateself-reported13.700