rasr-parakeet-v2

ATC ASR model targeting US air traffic control. A continued finetune of twangodev/rasr-parakeet-v1 on real US ATC audio from the TartanAviation corpus (CMU AirLab). v2 of the rasr toolkit.

v1 was strong on European ATC but weak on US ATC, since its only real-audio anchor was European. v2 adds about 29 hours of real US tower audio to close that gap.

Headline

Metric v2 v1
liveATC WER (US) 0.251 0.311
liveATC CER (US) 0.174 0.207
ATCO2 val WER (EU) 0.128 0.125
ATCO2 val CER (EU) 0.084 0.078

About 19% relative WER reduction on US ATC (liveATC) over v1, with European performance held flat. Both rows use rasr eval run on the same references, so the comparison is direct. One caveat: readback numeric WER regressed (see Limitations).

Quick start

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v2")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)

Or via the rasr eval toolkit:

pip install rasr
rasr eval run \
  -m nemo:hf://twangodev/rasr-parakeet-v2 \
  -d hf:SAadettin-BERber/liveATC_merged:train \
  --language en --batch-size 16

Architecture

  • Base: nvidia/parakeet-tdt-0.6b-v3 (FastConformer encoder, TDT decoder, 0.6B params)
  • Parent: twangodev/rasr-parakeet-v1 (this model is a continued finetune of v1)
  • Tokenizer: kept from base, SentencePiece BPE 8192 tokens
  • Sample rate: 16 kHz mono
  • Max input duration: 18 seconds (longer inputs may degrade; TDT joint memory)

Training data

v2 starts from v1's weights and continues training on the mix below. v1's own training (synthetic transcripts plus a European real anchor) is inherited in the weights; see the v1 card.

Source Type Role
TartanAviation, high inter-model agreement subset Real US ATC (KAGC tower) US training audio: 35,240 clips, 28.8 h
jlvdoorn/atco2-asr-atcosim (train) Real EU ATC plus simulator European rehearsal
jlvdoorn/atco2-asr (train) Real EU ATC European rehearsal

TartanAviation ships no transcripts. The training targets are pseudo labels from a weighted ROVER fusion of three ASR models (nvidia/parakeet-tdt-0.6b-v2, nvidia/canary-qwen-2.5b, jlvdoorn/whisper-large-v3-atco2-asr) plus ADS-B callsign snapping, filtered to the highest inter-model agreement (twangodev/tartanaviation-atc-labels). On a human-reviewed sample this subset is about 98% acceptable, but these are not human labels. Evaluation clips are held out by date.

TartanAviation components used:

Llama 3.2 attribution

This model is "Built with Llama" under the Llama 3.2 Community License. v2 is a continued finetune of v1, whose supervised targets were ATC dialogue transcripts generated by Llama 3.2; v2 is therefore a derivative of those Llama-trained weights. The new US targets in v2 were not generated by Llama (they are the pseudo labels described above).

TartanAviation attribution

US training audio is from the TartanAviation dataset (CMU AirLab, CC-BY-4.0). Please credit CMU AirLab and cite:

@article{patrikar2024tartanaviation,
  title={TartanAviation: Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations},
  author={Jay Patrikar and Joao Dantas and Brady Moon and Milad Hamidi and Sourish Ghosh and Nikhil Keetha and Ian Higgins and Atharva Chandak and Takashi Yoneyama and Sebastian Scherer},
  year={2024},
  eprint={2403.03372},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/pdf/2403.03372.pdf}
}

Training recipe

Continued finetune from v1 (rasr-parakeet-v1.nemo). Recipe: training_recipe.yaml (this repo).

Hyperparameter Value
Initialize from twangodev/rasr-parakeet-v1 (continued finetune, not from base)
Optimizer AdamW, betas (0.9, 0.98), weight_decay 1e-3
Learning rate 5e-5
Schedule CosineAnnealing, warmup 300 steps, min_lr 1e-6
Batch size 48
Precision bf16-mixed
Max steps 10,000
Augmentation SpecAugment (default), speed perturb 0.95 to 1.05
Max audio duration 18.0 s
Mixing weighted manifest concat (TartanAviation x1, ATCO2 train x1, ATCO2+ATCOSIM train x1)
Hardware NVIDIA RTX PRO 6000 Blackwell (96 GB)
Wall clock about 1 hour

Strengths

  • Lower US ATC WER: about 19% relative WER and 16% relative CER reduction on liveATC vs v1.
  • Real US adaptation: held-out US in-domain WER drops from 0.13 to 0.05; the model learns real US tower phraseology instead of substituting European or airline priors.
  • European performance maintained: ATCO2 val WER 0.125 to 0.128, no catastrophic forgetting.
  • Small footprint: 0.6B params, fits in 4 GB VRAM, same speed and size as v1.

Limitations

  1. Numeric readback regression. Digit WER got worse than v1: US numeric 0.371 to 0.421, EU numeric 0.050 to 0.064 (rasr eval run). v1's strength was numeric accuracy, and v2 trades some of it for the overall gain. A second normalization pipeline showed numeric roughly flat rather than worse, so the magnitude is uncertain, but it did not improve. For digit-critical use, evaluate v1 as well and layer hot-word or grammar constraints on number fields.
  2. Narrow US base. The US audio is 28.8 h from a single general-aviation tower (KAGC). It generalized to liveATC well, but expect weaker results on US airspace very different from a GA tower (busy approach or center).
  3. US eval is relative. liveATC labels are auto-generated, so the 19% is a relative v1 to v2 delta on identical references, not an absolute benchmarked WER. No fully public human-labeled US ATC ASR benchmark currently exists.
  4. Pseudo-label training. US supervised targets are model-generated, not human labels, so residual systematic labeler errors (especially digits) may be inherited.
  5. Inherited v1 biases. Operator substitution and European-prior artifacts from v1 may persist on out-of-distribution audio.

Recommended usage

  • US ATC: v2 is the better default for overall transcription. Use inference-time hot-word biasing against a known callsign and airport-name vocabulary (NeMo TDT supports it via change_decoding_strategy()).
  • Numeric or readback-critical content: compare against v1, and consider grammar or format constraints on digit fields.
  • European ATC: v1 and v2 are about equivalent; either works.
  • Safety-critical applications: this is a research and development checkpoint, not a safety-certified ATC transcription system. Always layer confidence-based rejection.

Citation

@software{rasr,
  author = {Ding, James},
  title = {rasr: ATC ASR finetuning toolkit},
  url = {https://github.com/twangodev/rasr},
  year = {2026}
}

Base model:

@misc{parakeet-tdt,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B-v3},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}

US training audio (TartanAviation) and Llama 3.2 (inherited from v1): see the attribution sections above.

License

Released under the Llama 3.2 Community License ("Built with Llama"). This is the binding upstream license: v2 is a continued finetune of v1, whose training transcripts were generated by Llama 3.2, so the weights are treated as a derivative work of Llama Materials.

This model also inherits attribution and use requirements from its other parents:

  • Parakeet-TDT-0.6B-v3 (CC-BY-4.0, NVIDIA), base architecture
  • TartanAviation (CC-BY-4.0, CMU AirLab), real US training audio; credit CMU AirLab and cite the paper above
  • ATCO2 corpus (CC-BY-4.0), European real anchor (train split)
  • ATCOSIM corpus (research use; see source)
  • radiotalk-us-audio-tada-noisy (Llama 3.2 Community License), inherited from v1

To redistribute or deploy:

  1. Include a copy of the Llama 3.2 Community License.
  2. Display "Built with Llama" in your product, user interface, or about page.
  3. Provide CC-BY-4.0 attribution to CMU AirLab (TartanAviation), NVIDIA (Parakeet), and the ATCO2 authors.
  4. Comply with the Llama 3.2 Acceptable Use Policy.
  5. If your service exceeds 700M monthly active users, request a separate commercial license from Meta.

This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for twangodev/rasr-parakeet-v2

Finetuned
(1)
this model

Datasets used to train twangodev/rasr-parakeet-v2

Collection including twangodev/rasr-parakeet-v2

Paper for twangodev/rasr-parakeet-v2

Evaluation results