rasr-parakeet-v1

ATC ASR finetune of nvidia/parakeet-tdt-0.6b-v3 on a synthetic US-style ATC corpus (radiotalk-us-audio-tada-noisy) with a small real-ATC anchor (ATCO2 + ATCOSIM train splits). Trained as v1 of the rasr toolkit.

Headline

Metric This model Prior public SOTA (jlvdoorn/whisper-large-v3-atco2-asr)
ATCO2 val WER 0.125 0.157
ATCO2 val CER 0.078 0.088
ATCO2 val numeric WER 0.050 0.074

21% relative WER reduction over the previous public SOTA on the ATCO2 validation benchmark, with a smaller base model (0.6B params vs 1.55B).

Quick start

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v1")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)

Or via the rasr eval toolkit:

pip install rasr
rasr eval run \
  -m nemo:hf://twangodev/rasr-parakeet-v1 \
  -d hf:jlvdoorn/atco2-asr:validation \
  --language en --batch-size 16

Architecture

  • Base: nvidia/parakeet-tdt-0.6b-v3 (FastConformer encoder + TDT decoder, 0.6B params)
  • Tokenizer: kept from base — SentencePiece BPE 8192 tokens, multilingual
  • Sample rate: 16 kHz mono
  • Max input duration: 18 seconds (extended-length inputs may degrade — TDT joint memory)

Training data

This model was trained on transcripts generated by Llama 3.2 and audio synthesized via the Tada TTS pipeline. Specifically:

Source Type Role
twangodev/radiotalk-us-audio-tada-noisy (200k subset) Synthetic US ATC Bulk training audio. Dialogue transcripts generated by Llama 3.2, audio synthesized by Tada (TTS) with VHF channel degradation pipeline.
jlvdoorn/atco2-asr (train split, ~446 clips) Real European ATC Real-data anchor; upweighted 10× to supply real-radio acoustic priors and European operator vocabulary.
jlvdoorn/atco2-asr-atcosim (train, ~10k clips) Real EU ATC + simulator Real-data anchor; upweighted 10×.

Llama 3.2 attribution

This model is "Built with Llama" under the Llama 3.2 Community License. Llama 3.2 was used to generate the ATC dialogue transcripts in the radiotalk-us-audio-tada-noisy dataset — those transcripts are the supervised targets the model learned to produce. The audio itself was synthesized by Tada (not Llama).

Training recipe

Full reproducible recipe: configs/train/rtx6kpro/parakeet-mixed.yaml.

Hyperparameter Value
Optimizer AdamW, β=(0.9, 0.98), weight_decay=1e-3
Learning rate 1e-4
Schedule CosineAnnealing, warmup 5000 steps, min_lr=1e-6
Batch size 32 (effective)
Precision bf16-mixed
Max steps 50,000
Augmentation SpecAugment (default), speed perturb 0.95-1.05
Max audio duration 18.0 s
Mixing weighted manifest concat (radiotalk ×1, ATCO2 train ×10, ATCO2+ATCOSIM train ×10)
Hardware NVIDIA RTX PRO 6000 Blackwell (96 GB)
Wall clock ~12 hours

Strengths

  • Structurally robust ATC output. Position-call grammar (CTAF + towered), runway IDs, headings, and altitude readbacks are recovered cleanly.
  • Strong on numeric/safety-critical content. Per-utterance numeric WER 0.050 on ATCO2 val (3× better than prior SOTA on the same axis).
  • Stable on out-of-distribution audio. Zero runaway hallucinations observed on real US GA audio (TartanAviation KBTP), unlike LLM-decoder ASR models (e.g., Canary-Qwen, Granite Speech) which confabulate confidently on hard audio.
  • Small footprint. 0.6B params, fits in 4 GB VRAM at inference; ~10× faster than larger Whisper-based ATC finetunes.

Limitations

This model was trained on a US-style synthetic corpus plus a European real-data anchor. The combination produces specific biases users should be aware of:

  1. Operator substitution bias. The model has been observed substituting unfamiliar callsigns with familiar ones from its training distribution — e.g., emitting "Lufthansa" or "Delta" where the audio contained a less-common operator. Particularly noticeable on US general aviation (GA) traffic, where N-number tail callsigns (e.g., "Cessna Eight One Niner Charlie Mike") may be mis-substituted with major airline prefixes.

  2. Limited US GA airport name coverage. The model has not seen most small US GA airport names during training. On real US GA audio (e.g., TartanAviation KBTP recordings), it produces phonetically-similar substitutions for the airport name ("Bravo Traffic", "Bello Traffic") instead of the correct name ("Butler Traffic").

  3. European real-anchor contamination on US output. Training included European-real ATCO2/ATCOSIM data to anchor distribution and unblock the SOTA result on ATCO2 val. This European prior is visible in US-style transcription (occasional "Swiss", "Bern Tower", "Belfast Tower" tokens that should not appear).

  4. Sanity rate on real US GA audio: 77% (10% CLEAN + 67% PLAUSIBLE-MISHEARD across 69 TartanAviation KBTP clips). Of the imperfect cases, the failure is overwhelmingly substitution of correct word in correct slot, not garbling or hallucination.

  5. Evaluation distribution. This model is benchmarked against ATCO2 (European real ATC). It has not been evaluated against a US ATC benchmark — no fully public US ATC ASR test set with annotations currently exists.

Recommended usage

  • For European ATC (or audio matching ATCO2-style distribution): deploy as-is. Numbers above are the expected performance.
  • For US ATC: use with inference-time hot-word biasing against a known callsign + airport-name vocabulary specific to the deployment region. NeMo's TDT decoder supports hot-word biasing via change_decoding_strategy(). Most substitution failures collapse to correct output with appropriate biasing.
  • For safety-critical applications: always layer with confidence-based rejection. This model is intended as a research/development checkpoint, not as a safety-certified ATC transcription system.

Citation

If you use this model, please cite the project and the underlying components:

@software{rasr,
  author = {Ding, James},
  title = {rasr: ATC ASR finetuning toolkit},
  url = {https://github.com/twangodev/rasr},
  year = {2026}
}

And the base model:

@misc{parakeet-tdt,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B-v3},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}

And Llama 3.2 (training transcripts):

@misc{llama3.2,
  author = {{Meta AI}},
  title = {The Llama 3.2 Herd of Models},
  year = {2024},
  url = {https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}

License

Released under the Llama 3.2 Community License ("Built with Llama"). This is the binding upstream license because the training transcripts were generated by Llama 3.2, and the resulting model is treated as a derivative work of Llama Materials for licensing purposes.

In addition to the Llama 3.2 terms, this model also inherits attribution and use requirements from its other parents:

  • Parakeet-TDT-0.6B-v3 (CC-BY-4.0, NVIDIA) — base model
  • ATCO2 corpus (CC-BY-4.0) — real-data anchor (train split)
  • ATCOSIM corpus (research use; see source)
  • radiotalk-us-audio-tada-noisy (Llama 3.2 Community License — transcripts generated by Llama 3.2, audio synthesized via Tada) — synthetic training audio

To redistribute or deploy:

  1. Include a copy of the Llama 3.2 Community License.
  2. Display "Built with Llama" in your product / user interface / about page.
  3. Comply with the Llama 3.2 Acceptable Use Policy.
  4. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.

This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for twangodev/rasr-parakeet-v1

Finetuned
(44)
this model

Datasets used to train twangodev/rasr-parakeet-v1

Collection including twangodev/rasr-parakeet-v1

Evaluation results