Comparison Tables: Vietnamese ASR
Table 1: Model Architecture Comparison
| Model | Year | Architecture | Base Model | Params | Training Data | Training Method |
|---|---|---|---|---|---|---|
| PhoWhisper-large | 2024 | Encoder-Decoder | Whisper large-v3 | ~1.5B | 844h Vietnamese | Full fine-tuning |
| PhoWhisper-base | 2024 | Encoder-Decoder | Whisper base | ~74M | 844h Vietnamese | Full fine-tuning |
| VietASR | 2025 | Self-supervised | Custom | - | 50h labeled + large unlabeled | Pre-train + fine-tune |
| Conformer-CTC (NeMo) | 2024 | Conformer-CTC | From scratch | 121M | 10k+ hours | Supervised |
| wav2vec2-vi-vlsp2020 | 2020 | wav2vec2 + CTC | wav2vec2-base | ~95M | VLSP 2020 | Fine-tuning |
| wav2vec2-viet-250h | - | wav2vec2 + CTC | wav2vec2-base | ~95M | 250h Vietnamese | Fine-tuning |
| XLSR-53-Viet | 2024 | XLSR-53 + CTC | XLSR-53 | ~315M | 1000h+ unlabeled Vi | Pre-train + fine-tune |
| w2v2-Viet | 2024 | wav2vec2 + CTC | wav2vec2-base | ~95M | 1000h+ unlabeled Vi | Pre-train + fine-tune |
| LoRA-Whisper (Phung) | 2024 | Encoder-Decoder | Whisper small/base/tiny | Varies | Military + general Vi | LoRA fine-tuning |
| Fast Conformer Vi | 2024 | Conformer CTC+RNNT | Fast Conformer | - | Vietnamese | CTC + RNNT |
| Moonshine (Vi) | 2025 | Custom small | Custom | Tiny | Vietnamese | Specialized training |
| Whisper large-v3 | 2023 | Encoder-Decoder | - | ~1.5B | 680k hours multilingual | Supervised (weak) |
Table 2: PhoWhisper Full Benchmark (ICLR 2024)
| Model | Params | Common Voice Vi WER | VIVOS WER | VLSP T1 WER | VLSP T2 WER |
|---|---|---|---|---|---|
| PhoWhisper-tiny | 39M | 19.05% | 10.41% | 20.74% | 49.85% |
| PhoWhisper-base | 74M | 16.19% | 8.46% | 19.70% | 43.01% |
| PhoWhisper-small | 244M | 11.08% | 6.33% | 15.93% | 32.96% |
| PhoWhisper-medium | 769M | 8.27% | 4.97% | 14.12% | 26.85% |
| PhoWhisper-large | 1.55B | 8.14% | 4.67% | 13.75% | 26.68% |
Table 3: Cross-Model Benchmark Comparison
VIVOS (Read Speech, ~15h)
| Model | WER (%) | CER (%) | Notes |
|---|---|---|---|
| PhoWhisper-large | 4.67 | - | SOTA |
| PhoWhisper-small | 6.33 | - | |
| wav2vec2-viet-250h (no LM) | 10.77 | - | Fine-tuned |
| wav2vec2-viet-250h (+ 4-gram LM) | 6.15 | - | LM decoding |
| Conformer-CTC + LM | 9.15 | 10.2 | With n-gram LM |
| Conformer-CTC (no LM) | 10.71 | 12.21 | Without LM |
| PhoWhisper-tiny | 10.41 | - | Smallest Whisper |
Common Voice Vietnamese
| Model | WER (%) | Notes |
|---|---|---|
| PhoWhisper-large | 8.14 | SOTA |
| VietASR (68M, iter3) | 11.46 | 22x fewer params than Whisper-large |
| wav2vec2-viet-250h (+ 4-gram LM) | 11.52 | With LM decoding |
| PhoWhisper-small | 11.08 | |
| wav2vec2-viet-250h (no LM) | 18.34 | |
| PhoWhisper-tiny | 19.05 |
VLSP 2020
| Model | Task-1 WER (%) | Task-2 WER (%) | Notes |
|---|---|---|---|
| PhoWhisper-large | 13.75 | 26.68 | SOTA on both tasks |
| PhoWhisper-medium | 14.12 | 26.85 | |
| wav2vec2-viet-250h (+ LM) | 9.11 | 40.81 | Strong T1 with LM |
| wav2vec2-viet-250h (no LM) | 13.33 | 51.45 | |
| PhoWhisper-tiny | 20.74 | 49.85 |
GigaSpeech 2 (Real-world)
| Model | WER (%) | Params | Notes |
|---|---|---|---|
| VietASR (iter3) | 7.68 | 68M | SOTA - 22x fewer params |
| Azure Speech CLI | ~11.78 avg | - | Commercial |
| Whisper Large-v3 | ~16.44 avg | 1,542M | Zero-shot |
ViMD (Multi-Dialect, 102.56h)
| Model | Northern WER (%) | Central WER (%) | Southern WER (%) | All WER (%) |
|---|---|---|---|---|
| wav2vec2-base-vi-vlsp2020 (FT) | 12.17 | 15.3 | 14.2 | 12.24 |
| wav2vec2-base-vietnamese-250h (FT) | 13.5 | 17.15 | 14.8 | 13.1 |
| PhoWhisperbase (FT) | 13.2 | 16.4 | 13.54 | 13.0 |
| wav2vec2-base-vietnamese-160h (no FT) | - | - | - | 31.74 |
| whisperbase (no FT) | - | - | - | 31.38 |
VietMed (Medical Domain)
| Model | WER (%) | Notes |
|---|---|---|
| XLSR-53 (baseline) | 51.8 | Standard XLSR-53 |
| XLSR-53-Viet | 29.6 | Pre-trained on Vietnamese |
| w2v2-Viet | ~35-40 | Estimated |
Table 3: LoRA vs Full Fine-tuning Comparison
| Approach | Model | Trainable Params | WER Improvement | Notes |
|---|---|---|---|---|
| Full fine-tuning | PhoWhisper-large | 100% (~1.5B) | SOTA | Best performance |
| LoRA | Whisper small (Phung et al.) | ~5-10% | 20% WER reduction | Domain-specific |
| LoRA-Whisper | Whisper (multilingual) | 5% | Near-monolingual parity | Language-specific LoRA |
| LoRA (ASR-1) | Whisper large-v3 | ~5-10% | TBD | Current project |
Table 4: Dataset Comparison
| Dataset | Total Hours | Labeled | Speakers | Domain | Dialects | Year | Venue |
|---|---|---|---|---|---|---|---|
| VIVOS | 15 | 15h | 65 | Read speech | Limited | 2016 | - |
| Common Voice vi | ~100+ | All | Many | Read speech | Mixed | Ongoing | Mozilla |
| VLSP 2020 T1 | ~100 | All | Many | Broadcast | Mixed | 2020 | VLSP |
| VLSP 2020 T2 | ~100 | All | Many | Conversational | Mixed | 2020 | VLSP |
| VietMed | 2216 | 16h | Many | Medical | All accents | 2024 | LREC-COLING |
| ViMD | 102.56 | All | Many | Mixed | 63 provinces | 2024 | EMNLP |
| GigaSpeech 2 | Large | Mixed | Many | Multi-domain | Mixed | 2024 | arXiv |
| VoxVietnam | Large | All | Many | Multi-genre | Mixed | 2024 | arXiv |
| VietLyrics | - | All | - | Music/lyrics | - | 2025 | arXiv |
Table 5: Approach Evolution Timeline
| Period | Dominant Approach | Key Models | Typical WER |
|---|---|---|---|
| 2015-2018 | GMM-HMM / DNN-HMM | Kaldi-based systems | 20-30%+ |
| 2019-2020 | E2E + Language Models | VAIS ASR, VLSP entries | 15-25% |
| 2020-2022 | Self-supervised (wav2vec2) | wav2vec2-vi, XLSR fine-tuned | 10-20% |
| 2023-2024 | Whisper fine-tuning | PhoWhisper, LoRA-Whisper | 6-15% |
| 2024-2025 | Large-scale pre-training + minimal labels | VietASR, Conformer-CTC | 5-10% |
| 2025+ | Specialized (CS, AVSR, edge) | TSPC, ViCocktail, Moonshine | TBD |