undertheseanlp
/

asr-1

Automatic Speech Recognition

Model card Files Files and versions

asr-1 / research /comparison.md

rain1024's picture

Add systematic literature review for Vietnamese ASR

0dee0d4 3 months ago

|

history blame contribute delete

6.29 kB

Comparison Tables: Vietnamese ASR

Table 1: Model Architecture Comparison

Model	Year	Architecture	Base Model	Params	Training Data	Training Method
PhoWhisper-large	2024	Encoder-Decoder	Whisper large-v3	~1.5B	844h Vietnamese	Full fine-tuning
PhoWhisper-base	2024	Encoder-Decoder	Whisper base	~74M	844h Vietnamese	Full fine-tuning
VietASR	2025	Self-supervised	Custom	-	50h labeled + large unlabeled	Pre-train + fine-tune
Conformer-CTC (NeMo)	2024	Conformer-CTC	From scratch	121M	10k+ hours	Supervised
wav2vec2-vi-vlsp2020	2020	wav2vec2 + CTC	wav2vec2-base	~95M	VLSP 2020	Fine-tuning
wav2vec2-viet-250h	-	wav2vec2 + CTC	wav2vec2-base	~95M	250h Vietnamese	Fine-tuning
XLSR-53-Viet	2024	XLSR-53 + CTC	XLSR-53	~315M	1000h+ unlabeled Vi	Pre-train + fine-tune
w2v2-Viet	2024	wav2vec2 + CTC	wav2vec2-base	~95M	1000h+ unlabeled Vi	Pre-train + fine-tune
LoRA-Whisper (Phung)	2024	Encoder-Decoder	Whisper small/base/tiny	Varies	Military + general Vi	LoRA fine-tuning
Fast Conformer Vi	2024	Conformer CTC+RNNT	Fast Conformer	-	Vietnamese	CTC + RNNT
Moonshine (Vi)	2025	Custom small	Custom	Tiny	Vietnamese	Specialized training
Whisper large-v3	2023	Encoder-Decoder	-	~1.5B	680k hours multilingual	Supervised (weak)

Table 2: PhoWhisper Full Benchmark (ICLR 2024)

Model	Params	Common Voice Vi WER	VIVOS WER	VLSP T1 WER	VLSP T2 WER
PhoWhisper-tiny	39M	19.05%	10.41%	20.74%	49.85%
PhoWhisper-base	74M	16.19%	8.46%	19.70%	43.01%
PhoWhisper-small	244M	11.08%	6.33%	15.93%	32.96%
PhoWhisper-medium	769M	8.27%	4.97%	14.12%	26.85%
PhoWhisper-large	1.55B	8.14%	4.67%	13.75%	26.68%

Table 3: Cross-Model Benchmark Comparison

VIVOS (Read Speech, ~15h)

Model	WER (%)	CER (%)	Notes
PhoWhisper-large	4.67	-	SOTA
PhoWhisper-small	6.33	-
wav2vec2-viet-250h (no LM)	10.77	-	Fine-tuned
wav2vec2-viet-250h (+ 4-gram LM)	6.15	-	LM decoding
Conformer-CTC + LM	9.15	10.2	With n-gram LM
Conformer-CTC (no LM)	10.71	12.21	Without LM
PhoWhisper-tiny	10.41	-	Smallest Whisper

Common Voice Vietnamese

Model	WER (%)	Notes
PhoWhisper-large	8.14	SOTA
VietASR (68M, iter3)	11.46	22x fewer params than Whisper-large
wav2vec2-viet-250h (+ 4-gram LM)	11.52	With LM decoding
PhoWhisper-small	11.08
wav2vec2-viet-250h (no LM)	18.34
PhoWhisper-tiny	19.05

VLSP 2020

Model	Task-1 WER (%)	Task-2 WER (%)	Notes
PhoWhisper-large	13.75	26.68	SOTA on both tasks
PhoWhisper-medium	14.12	26.85
wav2vec2-viet-250h (+ LM)	9.11	40.81	Strong T1 with LM
wav2vec2-viet-250h (no LM)	13.33	51.45
PhoWhisper-tiny	20.74	49.85

GigaSpeech 2 (Real-world)

Model	WER (%)	Params	Notes
VietASR (iter3)	7.68	68M	SOTA - 22x fewer params
Azure Speech CLI	~11.78 avg	-	Commercial
Whisper Large-v3	~16.44 avg	1,542M	Zero-shot

ViMD (Multi-Dialect, 102.56h)

Model	Northern WER (%)	Central WER (%)	Southern WER (%)	All WER (%)
wav2vec2-base-vi-vlsp2020 (FT)	12.17	15.3	14.2	12.24
wav2vec2-base-vietnamese-250h (FT)	13.5	17.15	14.8	13.1
PhoWhisperbase (FT)	13.2	16.4	13.54	13.0
wav2vec2-base-vietnamese-160h (no FT)	-	-	-	31.74
whisperbase (no FT)	-	-	-	31.38

VietMed (Medical Domain)

Model	WER (%)	Notes
XLSR-53 (baseline)	51.8	Standard XLSR-53
XLSR-53-Viet	29.6	Pre-trained on Vietnamese
w2v2-Viet	~35-40	Estimated

Table 3: LoRA vs Full Fine-tuning Comparison

Approach	Model	Trainable Params	WER Improvement	Notes
Full fine-tuning	PhoWhisper-large	100% (~1.5B)	SOTA	Best performance
LoRA	Whisper small (Phung et al.)	~5-10%	20% WER reduction	Domain-specific
LoRA-Whisper	Whisper (multilingual)	5%	Near-monolingual parity	Language-specific LoRA
LoRA (ASR-1)	Whisper large-v3	~5-10%	TBD	Current project

Table 4: Dataset Comparison

Dataset	Total Hours	Labeled	Speakers	Domain	Dialects	Year	Venue
VIVOS	15	15h	65	Read speech	Limited	2016	-
Common Voice vi	~100+	All	Many	Read speech	Mixed	Ongoing	Mozilla
VLSP 2020 T1	~100	All	Many	Broadcast	Mixed	2020	VLSP
VLSP 2020 T2	~100	All	Many	Conversational	Mixed	2020	VLSP
VietMed	2216	16h	Many	Medical	All accents	2024	LREC-COLING
ViMD	102.56	All	Many	Mixed	63 provinces	2024	EMNLP
GigaSpeech 2	Large	Mixed	Many	Multi-domain	Mixed	2024	arXiv
VoxVietnam	Large	All	Many	Multi-genre	Mixed	2024	arXiv
VietLyrics	-	All	-	Music/lyrics	-	2025	arXiv

Table 5: Approach Evolution Timeline

Period	Dominant Approach	Key Models	Typical WER
2015-2018	GMM-HMM / DNN-HMM	Kaldi-based systems	20-30%+
2019-2020	E2E + Language Models	VAIS ASR, VLSP entries	15-25%
2020-2022	Self-supervised (wav2vec2)	wav2vec2-vi, XLSR fine-tuned	10-20%
2023-2024	Whisper fine-tuning	PhoWhisper, LoRA-Whisper	6-15%
2024-2025	Large-scale pre-training + minimal labels	VietASR, Conformer-CTC	5-10%
2025+	Specialized (CS, AVSR, edge)	TSPC, ViCocktail, Moonshine	TBD