Automatic Speech Recognition
Transformers
Vietnamese
vietnamese
whisper
speech-to-text
asr-1 / research /comparison.md
rain1024's picture
Add systematic literature review for Vietnamese ASR
0dee0d4

Comparison Tables: Vietnamese ASR

Table 1: Model Architecture Comparison

Model Year Architecture Base Model Params Training Data Training Method
PhoWhisper-large 2024 Encoder-Decoder Whisper large-v3 ~1.5B 844h Vietnamese Full fine-tuning
PhoWhisper-base 2024 Encoder-Decoder Whisper base ~74M 844h Vietnamese Full fine-tuning
VietASR 2025 Self-supervised Custom - 50h labeled + large unlabeled Pre-train + fine-tune
Conformer-CTC (NeMo) 2024 Conformer-CTC From scratch 121M 10k+ hours Supervised
wav2vec2-vi-vlsp2020 2020 wav2vec2 + CTC wav2vec2-base ~95M VLSP 2020 Fine-tuning
wav2vec2-viet-250h - wav2vec2 + CTC wav2vec2-base ~95M 250h Vietnamese Fine-tuning
XLSR-53-Viet 2024 XLSR-53 + CTC XLSR-53 ~315M 1000h+ unlabeled Vi Pre-train + fine-tune
w2v2-Viet 2024 wav2vec2 + CTC wav2vec2-base ~95M 1000h+ unlabeled Vi Pre-train + fine-tune
LoRA-Whisper (Phung) 2024 Encoder-Decoder Whisper small/base/tiny Varies Military + general Vi LoRA fine-tuning
Fast Conformer Vi 2024 Conformer CTC+RNNT Fast Conformer - Vietnamese CTC + RNNT
Moonshine (Vi) 2025 Custom small Custom Tiny Vietnamese Specialized training
Whisper large-v3 2023 Encoder-Decoder - ~1.5B 680k hours multilingual Supervised (weak)

Table 2: PhoWhisper Full Benchmark (ICLR 2024)

Model Params Common Voice Vi WER VIVOS WER VLSP T1 WER VLSP T2 WER
PhoWhisper-tiny 39M 19.05% 10.41% 20.74% 49.85%
PhoWhisper-base 74M 16.19% 8.46% 19.70% 43.01%
PhoWhisper-small 244M 11.08% 6.33% 15.93% 32.96%
PhoWhisper-medium 769M 8.27% 4.97% 14.12% 26.85%
PhoWhisper-large 1.55B 8.14% 4.67% 13.75% 26.68%

Table 3: Cross-Model Benchmark Comparison

VIVOS (Read Speech, ~15h)

Model WER (%) CER (%) Notes
PhoWhisper-large 4.67 - SOTA
PhoWhisper-small 6.33 -
wav2vec2-viet-250h (no LM) 10.77 - Fine-tuned
wav2vec2-viet-250h (+ 4-gram LM) 6.15 - LM decoding
Conformer-CTC + LM 9.15 10.2 With n-gram LM
Conformer-CTC (no LM) 10.71 12.21 Without LM
PhoWhisper-tiny 10.41 - Smallest Whisper

Common Voice Vietnamese

Model WER (%) Notes
PhoWhisper-large 8.14 SOTA
VietASR (68M, iter3) 11.46 22x fewer params than Whisper-large
wav2vec2-viet-250h (+ 4-gram LM) 11.52 With LM decoding
PhoWhisper-small 11.08
wav2vec2-viet-250h (no LM) 18.34
PhoWhisper-tiny 19.05

VLSP 2020

Model Task-1 WER (%) Task-2 WER (%) Notes
PhoWhisper-large 13.75 26.68 SOTA on both tasks
PhoWhisper-medium 14.12 26.85
wav2vec2-viet-250h (+ LM) 9.11 40.81 Strong T1 with LM
wav2vec2-viet-250h (no LM) 13.33 51.45
PhoWhisper-tiny 20.74 49.85

GigaSpeech 2 (Real-world)

Model WER (%) Params Notes
VietASR (iter3) 7.68 68M SOTA - 22x fewer params
Azure Speech CLI ~11.78 avg - Commercial
Whisper Large-v3 ~16.44 avg 1,542M Zero-shot

ViMD (Multi-Dialect, 102.56h)

Model Northern WER (%) Central WER (%) Southern WER (%) All WER (%)
wav2vec2-base-vi-vlsp2020 (FT) 12.17 15.3 14.2 12.24
wav2vec2-base-vietnamese-250h (FT) 13.5 17.15 14.8 13.1
PhoWhisperbase (FT) 13.2 16.4 13.54 13.0
wav2vec2-base-vietnamese-160h (no FT) - - - 31.74
whisperbase (no FT) - - - 31.38

VietMed (Medical Domain)

Model WER (%) Notes
XLSR-53 (baseline) 51.8 Standard XLSR-53
XLSR-53-Viet 29.6 Pre-trained on Vietnamese
w2v2-Viet ~35-40 Estimated

Table 3: LoRA vs Full Fine-tuning Comparison

Approach Model Trainable Params WER Improvement Notes
Full fine-tuning PhoWhisper-large 100% (~1.5B) SOTA Best performance
LoRA Whisper small (Phung et al.) ~5-10% 20% WER reduction Domain-specific
LoRA-Whisper Whisper (multilingual) 5% Near-monolingual parity Language-specific LoRA
LoRA (ASR-1) Whisper large-v3 ~5-10% TBD Current project

Table 4: Dataset Comparison

Dataset Total Hours Labeled Speakers Domain Dialects Year Venue
VIVOS 15 15h 65 Read speech Limited 2016 -
Common Voice vi ~100+ All Many Read speech Mixed Ongoing Mozilla
VLSP 2020 T1 ~100 All Many Broadcast Mixed 2020 VLSP
VLSP 2020 T2 ~100 All Many Conversational Mixed 2020 VLSP
VietMed 2216 16h Many Medical All accents 2024 LREC-COLING
ViMD 102.56 All Many Mixed 63 provinces 2024 EMNLP
GigaSpeech 2 Large Mixed Many Multi-domain Mixed 2024 arXiv
VoxVietnam Large All Many Multi-genre Mixed 2024 arXiv
VietLyrics - All - Music/lyrics - 2025 arXiv

Table 5: Approach Evolution Timeline

Period Dominant Approach Key Models Typical WER
2015-2018 GMM-HMM / DNN-HMM Kaldi-based systems 20-30%+
2019-2020 E2E + Language Models VAIS ASR, VLSP entries 15-25%
2020-2022 Self-supervised (wav2vec2) wav2vec2-vi, XLSR fine-tuned 10-20%
2023-2024 Whisper fine-tuning PhoWhisper, LoRA-Whisper 6-15%
2024-2025 Large-scale pre-training + minimal labels VietASR, Conformer-CTC 5-10%
2025+ Specialized (CS, AVSR, edge) TSPC, ViCocktail, Moonshine TBD