Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
yuriyvnvย 
posted an update Apr 16
Post
638
๐ŸŽ™๏ธParakeet-TDT Fine Tuning: 4 New ASR Models

Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian โ€” among the first community fine-tunes of this architecture for the aforementioned languages

๐Ÿ“Š Results on Common Voice 17 test sets:

๐Ÿ‡ธ๐Ÿ‡ฎ Slovenian: 50.49% โ†’ 11.56% WER (-77%)
๐Ÿ‡ต๐Ÿ‡น Portuguese: 15.86% โ†’ 10.71% WER (-32%)
๐Ÿ‡ช๐Ÿ‡ช Estonian: 27.15% โ†’ 21.03% WER (-23%)
๐Ÿ‡ณ๐Ÿ‡ฑ Dutch: 5.99% โ†’ 5.33% WER (-11%)

All models output cased text with punctuation.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)




๐Ÿ”— Models:
๐Ÿ‡ณ๐Ÿ‡ฑ yuriyvnv/parakeet-tdt-0.6b-dutch
๐Ÿ‡ต๐Ÿ‡น yuriyvnv/parakeet-tdt-0.6b-portuguese
๐Ÿ‡ช๐Ÿ‡ช yuriyvnv/parakeet-tdt-0.6b-estonian
๐Ÿ‡ธ๐Ÿ‡ฎ yuriyvnv/parakeet-tdt-0.6b-slovenian

๐Ÿ—๏ธ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.

@hf-audio @NVIDIADev

#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

Amazing, impressive work!

Iโ€™m interested in fine-tuning Parakeet for low-resource African languages. could not find any NVIDIA fine-tuning guide yet, would you be open to sharing any pointers on your training setup, workflow, or resources you found helpful?

ยท

Thanks! Just pushed the repo public: github.com/yuriyvnv/TTS-Augmented-ASR

This is the codebase behind a paper I wrote on Estonian and Slovenian, so you'll find the full pipeline there: not just the Parakeet fine-tuning scripts, but also the synthetic data generation (LLM text diversification + OpenAI TTS synthesis) that powers the augmentation. Everything was trained on a single NVIDIA H100.

One thing worth knowing for African languages:

Parakeet v3 is only pretrained on 25 languages, so you'd be doing cross-lingual transfer from scratch. The base won't recognize the language zero-shot, but fine-tuning still works โ€” just expect a much rougher starting point than what you saw in my models.
Always evaluate zero-shot first. I had one language (Polish) where fine-tuning actually made things worse due to domain mismatch, or the learning rate was too low (still analyzing why this happened).
Standard recipe worked across everything I tried: AdamW, lr=5e-5, cosine annealing, 10% warmup, bf16, batch 32-64, early stopping on val_wer. The larger the batch size, especially for parakeet models, the better the gradient flow during training, since the model is compact.
Happy to help if you hit anything weird.