Parakeet-CTC-0.6B Unified Vietnamese–English CS

Figure 1: ASR WER comparison across different model and CSPs. This does not include Punctuation and Capitalisation errors. Details about each testset are mentioned below.

Description:

This model [1] is trained on public automatic speech recognition (ASR) datasets totaling more than 2,000 hours of Vietnamese (vi) speech. The model uses Qwen3 to generate punctuation and capitalization (PnC) for the training transcripts. The language model is trained on Vietnamese Wikipedia and further enhanced with a Vietnamese dictionary and lists of proper names and place names.

This model is ready for commercial/non-commercial use.

Deployment Geography:

Global

Use Case:

This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.

Model Architecture:

Architecture Type: Parakeet-CTC (also known as FastConformer-CTC) [1], [2] which is an optimized version of Conformer model [3] with 8x depthwise-separable convolutional downsampling with CTC loss.

Network Architecture: Parakeet-CTC-0.6B

This model was developed based on FastConformer encoder architecture.

Number of model parameters: 600 million model parameters.

Input(s):

Input Type(s): Audio

Input Format(s): .wav, .mp3, .flac, .ogg, .m4a

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: The maximum length (in seconds) specific to GPU memory, no pre-processing needed, a mono channel is required.

Output(s)

Output Type(s): Text

Output Format(s): String

Output Parameters: One-Dimensional (1D)

Other Properties Related to Output: There is no maximum character length, and does not handle special characters.

How to use this model:

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install -U nemo_toolkit['asr']

The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-ctc-0.6b-vi")

Transcribing using Python

Simply do:

output = asr_model.transcribe(['path_to_audios'])
print(output[0].text)

Transcribing with timestamps

To transcribe with timestamps:

output = asr_model.transcribe(['path_to_audios'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps

for stamp in segment_timestamps:
    print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")

Inference with n-gram language model

The 4-gram model and lexicon provided in the repository can be used, or simply training and fine-tuning n-gram language model with KenLM and NeMo with this tutorial.

To enhance the accuracy using n-gram language model:

decoding_cfg = CTCDecodingConfig()

decoding_cfg.strategy = "flashlight"
decoding_cfg.beam.search_type = "flashlight"
decoding_cfg.beam.kenlm_path = f'path_to_model'
decoding_cfg.beam.flashlight_cfg.lexicon_path=f'path_to_lexicon'
decoding_cfg.beam.beam_size = 64
decoding_cfg.beam.beam_alpha = 0.3
decoding_cfg.beam.beam_beta = 0.5
decoding_cfg.beam.flashlight_cfg.beam_size_token = 32
decoding_cfg.beam.flashlight_cfg.beam_threshold = 20.0
    
asr_model.change_decoding_strategy(decoding_cfg)

output = asr_model.transcribe(['path_to_audios'])
print(output[0].text)

Software Integration:

Runtime Engine(s):

Nemo - 2.6

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper
NVIDIA Volta

Preferred/Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

Parakeet-CTC-0.6B-unified ASR Vietnamese_1.1

Training and Evaluation Datasets:

The total size: ~2000 hours
Total number of datasets: 10

Training was conducted using this example script and CTC configuration.

The tokenizer was constructed from the training set transcripts using this script.

Training Dataset:

2000 hours for training, including:
- Common Voice Corpus 20.0
- VietMed-L (16h)
- LSVSC
- FLEURS
- InfoRe 2
- FOSD (FPT)
- VAIS
- VLSP 2020
- VSV-1100
- MS-SNSD

Fleurs transcriptions is preserve punctuation and capitalization. The remaining transcriptions is generated punctuation and capitalization by Qwen3 model.

Data Modality

Other: Speech

Audio Training Data Size (If Applicable)

Less than 10,000 Hours

Data Collection Method by dataset

Hybrid: Automated, Human

Labeling Method by dataset

Hybrid: Synthetic, Human

Evaluation Dataset:

Gigaspeech 2
VLSP 2021 Task 2
ViMD
VIVOS
Common Voice Corpus 20.0
FOSD
LSVSC
FLEURS
VLSP 2021 Task 1
VietMed

Data Collection Method by dataset:

Human

Labeling Method by dataset:

Human

Performance:

The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.

Blind testset

AVG WER	Gigaspeech2	VLSP 2021 Task 2	ViMD	VIVOS
9.30	11.23	8.99	11.02	5.96

In domain testset

AVG WER	MCV-Vi-20	FOSD	LSVSC	FLEURS	VLSP 2021 Task 1	VietMed
9.73	8.58	9.67	5.15	6.86	13.60	14.52

The VietMed benchmark testset has been fixed for transcript-audio misalignment.

Inference:

Acceleration Engine: NVIDIA Nemo

Test Hardware:

NVIDIA A10
NVIDIA A100
NVIDIA A30
NVIDIA H100
NVIDIA Jetson Orin
NVIDIA L4
NVIDIA L40
NVIDIA Turing T4
NVIDIA Volta V100
NVIDIA Blackwell GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).

Discover more from NVIDIA:

For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com. Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.