@yuriyvnv on Hugging Face: "🎙️Parakeet-TDT Fine Tuning: 4 New ASR Models Four fine-tuned versions of…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update Apr 16

Post

638

🎙️Parakeet-TDT Fine Tuning: 4 New ASR Models

Four fine-tuned versions of NVIDIA's Parakeet-TDT-0.6B-v3 for Dutch, Portuguese, Estonian, and Slovenian — among the first community fine-tunes of this architecture for the aforementioned languages

📊 Results on Common Voice 17 test sets:

🇸🇮 Slovenian: 50.49% → 11.56% WER (-77%)
🇵🇹 Portuguese: 15.86% → 10.71% WER (-32%)
🇪🇪 Estonian: 27.15% → 21.03% WER (-23%)
🇳🇱 Dutch: 5.99% → 5.33% WER (-11%)

All models output cased text with punctuation.

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained(
    "yuriyvnv/parakeet-tdt-0.6b-dutch"
)
output = model.transcribe(["audio.wav"])
print(output[0].text)

🔗 Models:
🇳🇱 yuriyvnv/parakeet-tdt-0.6b-dutch
🇵🇹 yuriyvnv/parakeet-tdt-0.6b-portuguese
🇪🇪 yuriyvnv/parakeet-tdt-0.6b-estonian
🇸🇮 yuriyvnv/parakeet-tdt-0.6b-slovenian

🏗️ Training: Common Voice 17 + synthetic speech (OpenAI TTS), filtered with WAVe (yuriyvnv/WAVe-1B-Multimodal-PT) for quality. AdamW + cosine annealing, bf16-mixed precision, early stopping on val WER. Timestamps and long-form audio supported.

@hf-audio @NVIDIADev

#asr #speech #parakeet #nvidia #nemo #multilingual #fine-tuning #commonvoice

manassehzw

Apr 18

Amazing, impressive work!

I’m interested in fine-tuning Parakeet for low-resource African languages. could not find any NVIDIA fine-tuning guide yet, would you be open to sharing any pointers on your training setup, workflow, or resources you found helpful?

yuriyvnv

Apr 18

Thanks! Just pushed the repo public: github.com/yuriyvnv/TTS-Augmented-ASR

This is the codebase behind a paper I wrote on Estonian and Slovenian, so you'll find the full pipeline there: not just the Parakeet fine-tuning scripts, but also the synthetic data generation (LLM text diversification + OpenAI TTS synthesis) that powers the augmentation. Everything was trained on a single NVIDIA H100.

One thing worth knowing for African languages:

Parakeet v3 is only pretrained on 25 languages, so you'd be doing cross-lingual transfer from scratch. The base won't recognize the language zero-shot, but fine-tuning still works — just expect a much rougher starting point than what you saw in my models.
Always evaluate zero-shot first. I had one language (Polish) where fine-tuning actually made things worse due to domain mismatch, or the learning rate was too low (still analyzing why this happened).
Standard recipe worked across everything I tried: AdamW, lr=5e-5, cosine annealing, 10% warmup, bf16, batch 32-64, early stopping on val_wer. The larger the batch size, especially for parakeet models, the better the gradient flow during training, since the model is compact.
Happy to help if you hit anything weird.

In this post