Whisper American ATC - Fine-tuned for US Air Traffic Control

Model Description

This model is a fine-tuned version of jlvdoorn/whisper-large-v3-atco2-asr adapted for American Air Traffic Control (ATC) radio communications. The base model was trained on European ATCO2 data; this version has been specialized for American ATC accents, phraseology, and radio characteristics.

Developed by: Jeffrey Suu
Model type: Automatic Speech Recognition (ASR)
Language: English (American ATC)
License: Apache 2.0
Finetuned from: jlvdoorn/whisper-large-v3-atco2-asr

Intended Use

Direct Use

This model is designed for transcribing American Air Traffic Control radio communications, including:

  • LiveATC recordings from US airports (IAH, JFK, SFO, etc.)
  • Pilot-controller communications
  • Ground control, tower, and approach frequencies

Out-of-Scope Use

  • Non-ATC aviation audio
  • Non-American English accents
  • General-purpose speech recognition
  • Safety-critical real-time ATC systems without human oversight

Training Details

Training Data

  • Source: LiveATC recordings from Houston IAH, New York JFK, San Francisco SFO
  • Size: 55 original clips (6 minutes), augmented to 275 samples (31 minutes)
  • Preprocessing:
    • Bandpass filtered (300-3400 Hz) to simulate ATC radio frequency response
    • Volume normalized
    • 5x data augmentation (time stretch, pitch shift, noise, gain)

Training Procedure

  • Training regime: Full fine-tuning (fp32)
  • Learning rate: 5e-6
  • Batch size: 4 (effective: 16 with gradient accumulation)
  • Epochs: 5
  • Hardware: Google Colab Tesla T4 (15GB VRAM)
  • Training time: ~25 minutes

Evaluation

Metrics

Word Error Rate (WER) on validation set:

Model WER
Base (European ATCO2) 30.3%
This model (American ATC) 13.7%
Improvement -16.6%

Key Improvements

โœ… Correctly transcribes American number formatting (e.g., "1503" not "Fifteen Zero Three")
โœ… Better handling of American accents and speech patterns
โœ… Improved recognition of US-specific callsigns and airports
โœ… Preserves numeric frequencies (e.g., "135.15" not "one three five one five")

How to Use

from transformers import pipeline

# Load model
transcriber = pipeline(
    "automatic-speech-recognition",
    model="jeffreysuu/whisper-american-atc",
    device=0  # Use GPU if available
)

# Transcribe audio
result = transcriber("path/to/atc_audio.wav")
print(result["text"])

Limitations and Bias

  • Limited training data: Fine-tuned on only 275 samples from 3 US airports
  • Airport bias: Best performance on IAH, JFK, SFO; may vary on other airports
  • Accent coverage: Primarily trained on American controllers; performance on non-American accents unknown
  • Not production-ready: Requires human verification for safety-critical applications

Technical Specifications

Model Architecture

  • Base: OpenAI Whisper Large v3 (1.5B parameters)
  • Encoder: Audio โ†’ Log-mel spectrogram
  • Decoder: Transformer-based text generation

Compute Infrastructure

  • Hardware: NVIDIA Tesla T4 (Google Colab)
  • Software:
    • Hugging Face Transformers 4.57.3
    • Python 3.12
    • PyTorch

Citation

If you use this model, please cite the original Whisper paper and ATCO2 work:

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022}
}

Contact

Model Card Author: Jeffrey Suu
GitHub: [Your GitHub]
Email: [Your Email]

For issues or questions about this model, please open an issue on the model repository.

Downloads last month
30
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results