Parakeet-CTC-0.6B Unified Vietnamese–English CS
Figure 1: ASR WER comparison across different model and CSPs. This does not include Punctuation and Capitalisation errors. Details about each testset are mentioned below.
Description:
This model [1] is trained on public automatic speech recognition (ASR) datasets totaling more than 2,000 hours of Vietnamese (vi) speech. The model uses Qwen3 to generate punctuation and capitalization (PnC) for the training transcripts. The language model is trained on Vietnamese Wikipedia and further enhanced with a Vietnamese dictionary and lists of proper names and place names.
This model is ready for commercial/non-commercial use.
Deployment Geography:
Global
Use Case:
This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms.
Model Architecture:
Architecture Type: Parakeet-CTC (also known as FastConformer-CTC) [1], [2] which is an optimized version of Conformer model [3] with 8x depthwise-separable convolutional downsampling with CTC loss.
Network Architecture: Parakeet-CTC-0.6B
This model was developed based on FastConformer encoder architecture.
Number of model parameters: 600 million model parameters.
Input(s):
Input Type(s): Audio
Input Format(s): .wav, .mp3, .flac, .ogg, .m4a
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: The maximum length (in seconds) specific to GPU memory, no pre-processing needed, a mono channel is required.
Output(s)
Output Type(s): Text
Output Format(s): String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: There is no maximum character length, and does not handle special characters.
How to use this model:
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.
pip install -U nemo_toolkit['asr']
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-ctc-0.6b-Vietnamese")
Transcribing using Python
Simply do:
output = asr_model.transcribe(['path_to_audios'])
print(output[0].text)
Transcribing with timestamps
To transcribe with timestamps:
output = asr_model.transcribe(['path_to_audios'], timestamps=True)
# by default, timestamps are enabled for char, word and segment level
word_timestamps = output[0].timestamp['word'] # word level timestamps for first sample
segment_timestamps = output[0].timestamp['segment'] # segment level timestamps
char_timestamps = output[0].timestamp['char'] # char level timestamps
for stamp in segment_timestamps:
print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}")
Inference with n-gram language model
Simply training and fine-tuning n-gram language model with KenLM and NeMo with this tutorial.
To enhance the accuracy using n-gram language model:
decoding_cfg = CTCDecodingConfig()
decoding_cfg.strategy = "flashlight"
decoding_cfg.beam.search_type = "flashlight"
decoding_cfg.beam.kenlm_path = f'path_to_model'
decoding_cfg.beam.flashlight_cfg.lexicon_path=f'path_to_lexicon'
decoding_cfg.beam.beam_size = 64
decoding_cfg.beam.beam_alpha = 0.3
decoding_cfg.beam.beam_beta = 0.5
decoding_cfg.beam.flashlight_cfg.beam_size_token = 32
decoding_cfg.beam.flashlight_cfg.beam_threshold = 20.0
asr_model.change_decoding_strategy(decoding_cfg)
output = asr_model.transcribe(['path_to_audios'])
print(output[0].text)
Software Integration:
Runtime Engine(s):
- Nemo - 2.6
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Volta
Preferred/Supported Operating System(s):
Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
Parakeet-CTC-0.6B-unified ASR Vietnamese_1.1
Training and Evaluation Datasets:
The total size: ~2000 hours
Total number of datasets: 10
Training was conducted using this example script and CTC configuration.
The tokenizer was constructed from the training set transcripts using this script.
Training Dataset:
- 2000 hours for training, including:
- Common Voice Corpus 20.0
- VietMed-L (16h)
- LSVSC
- FLEURS
- InfoRe 2
- FOSD (FPT)
- VAIS
- VLSP 2020
- VSV-1100
- MS-SNSD
Fleurs transcriptions is preserve punctuation and capitalization. The remaining transcriptions is generated punctuation and capitalization by Qwen3 model.
Data Modality
- Other: Speech
Audio Training Data Size (If Applicable)
- Less than 10,000 Hours
Data Collection Method by dataset
- Hybrid: Automated, Human
Labeling Method by dataset
- Hybrid: Synthetic, Human
Evaluation Dataset:
- Gigaspeech 2
- VLSP 2021 Task 2
- ViMD
- VIVOS
- Common Voice Corpus 20.0
- FOSD
- LSVSC
- FLEURS
- VLSP 2021 Task 1
- VietMed
Data Collection Method by dataset:
- Human
Labeling Method by dataset:
- Human
Performance:
The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio.
Blind testset
| AVG WER | Gigaspeech2 | VLSP 2021 Task 2 | ViMD | VIVOS |
|---|---|---|---|---|
| 9.30 | 11.23 | 8.99 | 11.02 | 5.96 |
In domain testset
| AVG WER | MCV-Vi-20 | FOSD | LSVSC | FLEURS | VLSP 2021 Task 1 | VietMed |
|---|---|---|---|---|---|---|
| 9.73 | 8.58 | 9.67 | 5.15 | 6.86 | 13.60 | 14.52 |
The VietMed benchmark testset has been fixed for transcript-audio misalignment.
Inference:
Acceleration Engine: NVIDIA Nemo
Test Hardware:
- NVIDIA A10
- NVIDIA A100
- NVIDIA A30
- NVIDIA H100
- NVIDIA Jetson Orin
- NVIDIA L4
- NVIDIA L40
- NVIDIA Turing T4
- NVIDIA Volta V100
- NVIDIA Blackwell GPU
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
License/Terms of Use:
GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
Discover more from NVIDIA:
For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com.
Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models.
Explore more from NVIDIA:
What is Nemotron?
NVIDIA Developer Nemotron
NVIDIA Riva Speech
NeMo Documentation
References(s):
[1] https://arxiv.org/abs/2305.05084
[2] https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html
[3] https://arxiv.org/abs/2005.08100
- Downloads last month
- 154