rasr-parakeet-v2

ATC ASR model targeting US air traffic control. A continued finetune of twangodev/rasr-parakeet-v1 on real US ATC audio from the TartanAviation corpus (CMU AirLab). v2 of the rasr toolkit.

v1 was strong on European ATC but weak on US ATC, since its only real-audio anchor was European. v2 adds about 29 hours of real US tower audio to close that gap.

Headline

Metric	v2	v1
liveATC WER (US)	0.251	0.311
liveATC CER (US)	0.174	0.207
ATCO2 val WER (EU)	0.128	0.125
ATCO2 val CER (EU)	0.084	0.078

About 19% relative WER reduction on US ATC (liveATC) over v1, with European performance held flat. Both rows use rasr eval run on the same references, so the comparison is direct. One caveat: readback numeric WER regressed (see Limitations).

Quick start

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v2")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)

Or via the rasr eval toolkit:

pip install rasr
rasr eval run \
  -m nemo:hf://twangodev/rasr-parakeet-v2 \
  -d hf:SAadettin-BERber/liveATC_merged:train \
  --language en --batch-size 16

Architecture

Base: nvidia/parakeet-tdt-0.6b-v3 (FastConformer encoder, TDT decoder, 0.6B params)
Parent: twangodev/rasr-parakeet-v1 (this model is a continued finetune of v1)
Tokenizer: kept from base, SentencePiece BPE 8192 tokens
Sample rate: 16 kHz mono
Max input duration: 18 seconds (longer inputs may degrade; TDT joint memory)

Training data

v2 starts from v1's weights and continues training on the mix below. v1's own training (synthetic transcripts plus a European real anchor) is inherited in the weights; see the v1 card.

Source	Type	Role
TartanAviation, high inter-model agreement subset	Real US ATC (KAGC tower)	US training audio: 35,240 clips, 28.8 h
`jlvdoorn/atco2-asr-atcosim` (train)	Real EU ATC plus simulator	European rehearsal
`jlvdoorn/atco2-asr` (train)	Real EU ATC	European rehearsal

TartanAviation ships no transcripts. The training targets are pseudo labels from a weighted ROVER fusion of three ASR models (nvidia/parakeet-tdt-0.6b-v2, nvidia/canary-qwen-2.5b, jlvdoorn/whisper-large-v3-atco2-asr) plus ADS-B callsign snapping, filtered to the highest inter-model agreement (twangodev/tartanaviation-atc-labels). On a human-reviewed sample this subset is about 98% acceptable, but these are not human labels. Evaluation clips are held out by date.

TartanAviation components used:

twangodev/tartanaviation-atc-adsb-utterances: segmented US ATC audio (16 kHz, with per-clip ADS-B tracks).
twangodev/tartanaviation-atc-labels: the pseudo-label transcripts and agreement scores used as supervised targets.
twangodev/tartanaviation-atc-adsb: ADS-B trajectory data, used for the callsign snapping in the labels.

Llama 3.2 attribution

This model is "Built with Llama" under the Llama 3.2 Community License. v2 is a continued finetune of v1, whose supervised targets were ATC dialogue transcripts generated by Llama 3.2; v2 is therefore a derivative of those Llama-trained weights. The new US targets in v2 were not generated by Llama (they are the pseudo labels described above).

TartanAviation attribution

US training audio is from the TartanAviation dataset (CMU AirLab, CC-BY-4.0). Please credit CMU AirLab and cite:

@article{patrikar2024tartanaviation,
  title={TartanAviation: Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations},
  author={Jay Patrikar and Joao Dantas and Brady Moon and Milad Hamidi and Sourish Ghosh and Nikhil Keetha and Ian Higgins and Atharva Chandak and Takashi Yoneyama and Sebastian Scherer},
  year={2024},
  eprint={2403.03372},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/pdf/2403.03372.pdf}
}

Training recipe

Continued finetune from v1 (rasr-parakeet-v1.nemo). Recipe: training_recipe.yaml (this repo).

Hyperparameter	Value
Initialize from	`twangodev/rasr-parakeet-v1` (continued finetune, not from base)
Optimizer	AdamW, betas (0.9, 0.98), weight_decay 1e-3
Learning rate	5e-5
Schedule	CosineAnnealing, warmup 300 steps, min_lr 1e-6
Batch size	48
Precision	bf16-mixed
Max steps	10,000
Augmentation	SpecAugment (default), speed perturb 0.95 to 1.05
Max audio duration	18.0 s
Mixing	weighted manifest concat (TartanAviation x1, ATCO2 train x1, ATCO2+ATCOSIM train x1)
Hardware	NVIDIA RTX PRO 6000 Blackwell (96 GB)
Wall clock	about 1 hour

Strengths

Lower US ATC WER: about 19% relative WER and 16% relative CER reduction on liveATC vs v1.
Real US adaptation: held-out US in-domain WER drops from 0.13 to 0.05; the model learns real US tower phraseology instead of substituting European or airline priors.
European performance maintained: ATCO2 val WER 0.125 to 0.128, no catastrophic forgetting.
Small footprint: 0.6B params, fits in 4 GB VRAM, same speed and size as v1.

Limitations

Numeric readback regression. Digit WER got worse than v1: US numeric 0.371 to 0.421, EU numeric 0.050 to 0.064 (rasr eval run). v1's strength was numeric accuracy, and v2 trades some of it for the overall gain. A second normalization pipeline showed numeric roughly flat rather than worse, so the magnitude is uncertain, but it did not improve. For digit-critical use, evaluate v1 as well and layer hot-word or grammar constraints on number fields.
Narrow US base. The US audio is 28.8 h from a single general-aviation tower (KAGC). It generalized to liveATC well, but expect weaker results on US airspace very different from a GA tower (busy approach or center).
US eval is relative. liveATC labels are auto-generated, so the 19% is a relative v1 to v2 delta on identical references, not an absolute benchmarked WER. No fully public human-labeled US ATC ASR benchmark currently exists.
Pseudo-label training. US supervised targets are model-generated, not human labels, so residual systematic labeler errors (especially digits) may be inherited.
Inherited v1 biases. Operator substitution and European-prior artifacts from v1 may persist on out-of-distribution audio.

Recommended usage

US ATC: v2 is the better default for overall transcription. Use inference-time hot-word biasing against a known callsign and airport-name vocabulary (NeMo TDT supports it via change_decoding_strategy()).
Numeric or readback-critical content: compare against v1, and consider grammar or format constraints on digit fields.
European ATC: v1 and v2 are about equivalent; either works.
Safety-critical applications: this is a research and development checkpoint, not a safety-certified ATC transcription system. Always layer confidence-based rejection.

Citation

@software{rasr,
  author = {Ding, James},
  title = {rasr: ATC ASR finetuning toolkit},
  url = {https://github.com/twangodev/rasr},
  year = {2026}
}

Base model:

@misc{parakeet-tdt,
  author = {NVIDIA},
  title = {Parakeet-TDT-0.6B-v3},
  url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}

US training audio (TartanAviation) and Llama 3.2 (inherited from v1): see the attribution sections above.

License

Released under the Llama 3.2 Community License ("Built with Llama"). This is the binding upstream license: v2 is a continued finetune of v1, whose training transcripts were generated by Llama 3.2, so the weights are treated as a derivative work of Llama Materials.

This model also inherits attribution and use requirements from its other parents:

Parakeet-TDT-0.6B-v3 (CC-BY-4.0, NVIDIA), base architecture
TartanAviation (CC-BY-4.0, CMU AirLab), real US training audio; credit CMU AirLab and cite the paper above
ATCO2 corpus (CC-BY-4.0), European real anchor (train split)
ATCOSIM corpus (research use; see source)
radiotalk-us-audio-tada-noisy (Llama 3.2 Community License), inherited from v1

To redistribute or deploy:

Include a copy of the Llama 3.2 Community License.
Display "Built with Llama" in your product, user interface, or about page.
Provide CC-BY-4.0 attribution to CMU AirLab (TartanAviation), NVIDIA (Parakeet), and the ATCO2 authors.
Comply with the Llama 3.2 Acceptable Use Policy.
If your service exceeds 700M monthly active users, request a separate commercial license from Meta.

This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.

Downloads last month: 20

Model tree for twangodev/rasr-parakeet-v2

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

twangodev/rasr-parakeet-v1

Finetuned

(1)

this model

Datasets used to train twangodev/rasr-parakeet-v2

Collection including twangodev/rasr-parakeet-v2

rasr

Collection

ASR for Air Traffic Control • 2 items • Updated May 27

Paper for twangodev/rasr-parakeet-v2

TartanAviation: Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations

Paper • 2403.03372 • Published Mar 5, 2024

Evaluation results

Word Error Rate on liveATC (SAadettin-BERber/liveATC_merged)
self-reported

0.251
Character Error Rate on liveATC (SAadettin-BERber/liveATC_merged)
self-reported

0.174
Word Error Rate on ATCO2 (jlvdoorn/atco2-asr validation)
validation set self-reported

0.128
Character Error Rate on ATCO2 (jlvdoorn/atco2-asr validation)
validation set self-reported

0.084