Instructions to use twangodev/rasr-parakeet-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use twangodev/rasr-parakeet-v2 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v2") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
rasr-parakeet-v2
ATC ASR model targeting US air traffic control. A continued finetune of twangodev/rasr-parakeet-v1 on real US ATC audio from the TartanAviation corpus (CMU AirLab). v2 of the rasr toolkit.
v1 was strong on European ATC but weak on US ATC, since its only real-audio anchor was European. v2 adds about 29 hours of real US tower audio to close that gap.
Headline
| Metric | v2 | v1 |
|---|---|---|
| liveATC WER (US) | 0.251 | 0.311 |
| liveATC CER (US) | 0.174 | 0.207 |
| ATCO2 val WER (EU) | 0.128 | 0.125 |
| ATCO2 val CER (EU) | 0.084 | 0.078 |
About 19% relative WER reduction on US ATC (liveATC) over v1, with European performance held flat. Both rows use rasr eval run on the same references, so the comparison is direct. One caveat: readback numeric WER regressed (see Limitations).
Quick start
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("twangodev/rasr-parakeet-v2")
result = model.transcribe(["atc_clip.wav"])
print(result[0].text)
Or via the rasr eval toolkit:
pip install rasr
rasr eval run \
-m nemo:hf://twangodev/rasr-parakeet-v2 \
-d hf:SAadettin-BERber/liveATC_merged:train \
--language en --batch-size 16
Architecture
- Base:
nvidia/parakeet-tdt-0.6b-v3(FastConformer encoder, TDT decoder, 0.6B params) - Parent:
twangodev/rasr-parakeet-v1(this model is a continued finetune of v1) - Tokenizer: kept from base, SentencePiece BPE 8192 tokens
- Sample rate: 16 kHz mono
- Max input duration: 18 seconds (longer inputs may degrade; TDT joint memory)
Training data
v2 starts from v1's weights and continues training on the mix below. v1's own training (synthetic transcripts plus a European real anchor) is inherited in the weights; see the v1 card.
| Source | Type | Role |
|---|---|---|
| TartanAviation, high inter-model agreement subset | Real US ATC (KAGC tower) | US training audio: 35,240 clips, 28.8 h |
jlvdoorn/atco2-asr-atcosim (train) |
Real EU ATC plus simulator | European rehearsal |
jlvdoorn/atco2-asr (train) |
Real EU ATC | European rehearsal |
TartanAviation ships no transcripts. The training targets are pseudo labels from a weighted ROVER fusion of three ASR models (nvidia/parakeet-tdt-0.6b-v2, nvidia/canary-qwen-2.5b, jlvdoorn/whisper-large-v3-atco2-asr) plus ADS-B callsign snapping, filtered to the highest inter-model agreement (twangodev/tartanaviation-atc-labels). On a human-reviewed sample this subset is about 98% acceptable, but these are not human labels. Evaluation clips are held out by date.
TartanAviation components used:
twangodev/tartanaviation-atc-adsb-utterances: segmented US ATC audio (16 kHz, with per-clip ADS-B tracks).twangodev/tartanaviation-atc-labels: the pseudo-label transcripts and agreement scores used as supervised targets.twangodev/tartanaviation-atc-adsb: ADS-B trajectory data, used for the callsign snapping in the labels.
Llama 3.2 attribution
This model is "Built with Llama" under the Llama 3.2 Community License. v2 is a continued finetune of v1, whose supervised targets were ATC dialogue transcripts generated by Llama 3.2; v2 is therefore a derivative of those Llama-trained weights. The new US targets in v2 were not generated by Llama (they are the pseudo labels described above).
TartanAviation attribution
US training audio is from the TartanAviation dataset (CMU AirLab, CC-BY-4.0). Please credit CMU AirLab and cite:
@article{patrikar2024tartanaviation,
title={TartanAviation: Image, Speech, and ADS-B Trajectory Datasets for Terminal Airspace Operations},
author={Jay Patrikar and Joao Dantas and Brady Moon and Milad Hamidi and Sourish Ghosh and Nikhil Keetha and Ian Higgins and Atharva Chandak and Takashi Yoneyama and Sebastian Scherer},
year={2024},
eprint={2403.03372},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/pdf/2403.03372.pdf}
}
Training recipe
Continued finetune from v1 (rasr-parakeet-v1.nemo). Recipe: training_recipe.yaml (this repo).
| Hyperparameter | Value |
|---|---|
| Initialize from | twangodev/rasr-parakeet-v1 (continued finetune, not from base) |
| Optimizer | AdamW, betas (0.9, 0.98), weight_decay 1e-3 |
| Learning rate | 5e-5 |
| Schedule | CosineAnnealing, warmup 300 steps, min_lr 1e-6 |
| Batch size | 48 |
| Precision | bf16-mixed |
| Max steps | 10,000 |
| Augmentation | SpecAugment (default), speed perturb 0.95 to 1.05 |
| Max audio duration | 18.0 s |
| Mixing | weighted manifest concat (TartanAviation x1, ATCO2 train x1, ATCO2+ATCOSIM train x1) |
| Hardware | NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| Wall clock | about 1 hour |
Strengths
- Lower US ATC WER: about 19% relative WER and 16% relative CER reduction on liveATC vs v1.
- Real US adaptation: held-out US in-domain WER drops from 0.13 to 0.05; the model learns real US tower phraseology instead of substituting European or airline priors.
- European performance maintained: ATCO2 val WER 0.125 to 0.128, no catastrophic forgetting.
- Small footprint: 0.6B params, fits in 4 GB VRAM, same speed and size as v1.
Limitations
- Numeric readback regression. Digit WER got worse than v1: US numeric 0.371 to 0.421, EU numeric 0.050 to 0.064 (
rasr eval run). v1's strength was numeric accuracy, and v2 trades some of it for the overall gain. A second normalization pipeline showed numeric roughly flat rather than worse, so the magnitude is uncertain, but it did not improve. For digit-critical use, evaluate v1 as well and layer hot-word or grammar constraints on number fields. - Narrow US base. The US audio is 28.8 h from a single general-aviation tower (KAGC). It generalized to liveATC well, but expect weaker results on US airspace very different from a GA tower (busy approach or center).
- US eval is relative. liveATC labels are auto-generated, so the 19% is a relative v1 to v2 delta on identical references, not an absolute benchmarked WER. No fully public human-labeled US ATC ASR benchmark currently exists.
- Pseudo-label training. US supervised targets are model-generated, not human labels, so residual systematic labeler errors (especially digits) may be inherited.
- Inherited v1 biases. Operator substitution and European-prior artifacts from v1 may persist on out-of-distribution audio.
Recommended usage
- US ATC: v2 is the better default for overall transcription. Use inference-time hot-word biasing against a known callsign and airport-name vocabulary (NeMo TDT supports it via
change_decoding_strategy()). - Numeric or readback-critical content: compare against v1, and consider grammar or format constraints on digit fields.
- European ATC: v1 and v2 are about equivalent; either works.
- Safety-critical applications: this is a research and development checkpoint, not a safety-certified ATC transcription system. Always layer confidence-based rejection.
Citation
@software{rasr,
author = {Ding, James},
title = {rasr: ATC ASR finetuning toolkit},
url = {https://github.com/twangodev/rasr},
year = {2026}
}
Base model:
@misc{parakeet-tdt,
author = {NVIDIA},
title = {Parakeet-TDT-0.6B-v3},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
US training audio (TartanAviation) and Llama 3.2 (inherited from v1): see the attribution sections above.
License
Released under the Llama 3.2 Community License ("Built with Llama"). This is the binding upstream license: v2 is a continued finetune of v1, whose training transcripts were generated by Llama 3.2, so the weights are treated as a derivative work of Llama Materials.
This model also inherits attribution and use requirements from its other parents:
- Parakeet-TDT-0.6B-v3 (CC-BY-4.0, NVIDIA), base architecture
- TartanAviation (CC-BY-4.0, CMU AirLab), real US training audio; credit CMU AirLab and cite the paper above
- ATCO2 corpus (CC-BY-4.0), European real anchor (train split)
- ATCOSIM corpus (research use; see source)
- radiotalk-us-audio-tada-noisy (Llama 3.2 Community License), inherited from v1
To redistribute or deploy:
- Include a copy of the Llama 3.2 Community License.
- Display "Built with Llama" in your product, user interface, or about page.
- Provide CC-BY-4.0 attribution to CMU AirLab (TartanAviation), NVIDIA (Parakeet), and the ATCO2 authors.
- Comply with the Llama 3.2 Acceptable Use Policy.
- If your service exceeds 700M monthly active users, request a separate commercial license from Meta.
This is not legal advice. If you are deploying this model commercially or at scale, consult a lawyer regarding the interaction of the upstream licenses.
- Downloads last month
- -
Model tree for twangodev/rasr-parakeet-v2
Base model
nvidia/parakeet-tdt-0.6b-v3Datasets used to train twangodev/rasr-parakeet-v2
twangodev/tartanaviation-atc-adsb-utterances
jlvdoorn/atco2-asr-atcosim
Collection including twangodev/rasr-parakeet-v2
Paper for twangodev/rasr-parakeet-v2
Evaluation results
- Word Error Rate on liveATC (SAadettin-BERber/liveATC_merged)self-reported0.251
- Character Error Rate on liveATC (SAadettin-BERber/liveATC_merged)self-reported0.174
- Word Error Rate on ATCO2 (jlvdoorn/atco2-asr validation)validation set self-reported0.128
- Character Error Rate on ATCO2 (jlvdoorn/atco2-asr validation)validation set self-reported0.084