Parakeet-CTC-0.6B Unified Vietnamese–English CS

Word Error Rate (WER) across multiple Vietnamese datasets Figure 1: ASR WER comparison across different model and CSPs. This does not include Punctuation and Capitalisation errors. Details about each testset are mentioned below.

Description:

This model [1] is trained on public automatic speech recognition (ASR) datasets totaling more than 2,000 hours of Vietnamese (vi) speech. The model uses Qwen3 to generate punctuation and capitalization (PnC) for the training transcripts. The language model is trained on Vietnamese Wikipedia and further enhanced with a Vietnamese dictionary and lists of proper names and place names.

This model is ready for commercial/non-commercial use.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Model Architecture:

Architecture Type: Parakeet-CTC (also known as FastConformer-CTC) [1], [2] which is an optimized version of Conformer model [3] with 8x depthwise-separable convolutional downsampling with CTC loss.

Network Architecture: Parakeet-CTC-0.6B

This model was developed based on FastConformer encoder architecture.

Number of model parameters: 600 million model parameters.

Input(s):

Input Type(s): Audio

Input Format(s): .wav, .mp3, .flac, .ogg, .m4a

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: The maximum length (in seconds) specific to GPU memory, no pre-processing needed, a mono channel is required.

Output(s)

Output Type(s): Text

Output Format(s): String

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: There is no maximum character length, and does not handle special characters.

How to use this model:

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit['asr']

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-ctc-0.6b-Vietnamese")

Transcribing using Python

Simply do:

output = asr_model.transcribe(['path_to_audios'])
print(output[0].text)

Transcribing with timestamps

To transcribe with timestamps:

output = asr_model.transcribe(['path_to_audios'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

Inference with n-gram language model

Simply training and fine-tuning n-gram language model with KenLM and NeMo with this tutorial.

To enhance the accuracy using n-gram language model:

decoding_cfg = CTCDecodingConfig()

decoding_cfg.strategy = "flashlight"
decoding_cfg.beam.search_type = "flashlight"
decoding_cfg.beam.kenlm_path = f'path_to_model'
decoding_cfg.beam.flashlight_cfg.lexicon_path=f'path_to_lexicon'
decoding_cfg.beam.beam_size = 64
decoding_cfg.beam.beam_alpha = 0.3
decoding_cfg.beam.beam_beta = 0.5
decoding_cfg.beam.flashlight_cfg.beam_size_token = 32
decoding_cfg.beam.flashlight_cfg.beam_threshold = 20.0
    
asr_model.change_decoding_strategy(decoding_cfg)

output = asr_model.transcribe(['path_to_audios'])
print(output[0].text)

Software Integration:

Runtime Engine(s):

  • Nemo - 2.6

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Volta

Preferred/Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

Parakeet-CTC-0.6B-unified ASR Vietnamese_1.1

Training and Evaluation Datasets:

The total size: ~2000 hours
Total number of datasets: 10

Training was conducted using this example script and CTC configuration.

The tokenizer was constructed from the training set transcripts using this script.

Training Dataset:

  • 2000 hours for training, including:
    • Common Voice Corpus 20.0
    • VietMed-L (16h)
    • LSVSC
    • FLEURS
    • InfoRe 2
    • FOSD (FPT)
    • VAIS
    • VLSP 2020
    • VSV-1100
    • MS-SNSD

Fleurs transcriptions is preserve punctuation and capitalization. The remaining transcriptions is generated punctuation and capitalization by Qwen3 model.

Data Modality

  • Other: Speech

Audio Training Data Size (If Applicable)

  • Less than 10,000 Hours

Data Collection Method by dataset

  • Hybrid: Automated, Human

Labeling Method by dataset

  • Hybrid: Synthetic, Human

Evaluation Dataset:

  • Gigaspeech 2
  • VLSP 2021 Task 2
  • ViMD
  • VIVOS
  • Common Voice Corpus 20.0
  • FOSD
  • LSVSC
  • FLEURS
  • VLSP 2021 Task 1
  • VietMed

Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Human

Performance:

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Blind testset

AVG WER Gigaspeech2 VLSP 2021 Task 2 ViMD VIVOS
9.30 11.23 8.99 11.02 5.96

In domain testset

AVG WER MCV-Vi-20 FOSD LSVSC FLEURS VLSP 2021 Task 1 VietMed
9.73 8.58 9.67 5.15 6.86 13.60 14.52

The VietMed benchmark testset has been fixed for transcript-audio misalignment.

Inference:

Acceleration Engine: NVIDIA Nemo

Test Hardware:

  • NVIDIA A10
  • NVIDIA A100
  • NVIDIA A30
  • NVIDIA H100
  • NVIDIA Jetson Orin
  • NVIDIA L4
  • NVIDIA L40
  • NVIDIA Turing T4
  • NVIDIA Volta V100
  • NVIDIA Blackwell GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).

Discover more from NVIDIA:

For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com. Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.

Explore more from NVIDIA:

What is Nemotron?
NVIDIA Developer Nemotron
NVIDIA Riva Speech
NeMo Documentation

References(s):

[1] https://arxiv.org/abs/2305.05084
[2] https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html
[3] https://arxiv.org/abs/2005.08100

Downloads last month
154
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/parakeet-ctc-0.6b-Vietnamese

Papers for nvidia/parakeet-ctc-0.6b-Vietnamese