Vietnamese Zipformer ASR

Model description

Model type: Automatic speech recognition (ASR) — streaming-capable transducer (Zipformer encoder + stateless decoder + joiner), exported as ONNX and quantized to INT8 for deployment.

Languages: Vietnamese (primary); handles Vietnamese–English code-switched speech.

Training framework: Icefall (k2 + Lhotse); pruned RNN-T loss; optional CTC; ScaledAdam + Eden scheduler; mixed precision (FP16/BF16).

Evaluation (results-4): WER reported on VietCasual test sets (combined A+B), normalized transcripts, jiwer-based WER.

Model variants and performance

Variant	Checkpoint	Overall WER (%)	Notes
exp7	epoch-4-avg-4.int8	8.22	Best WER in results-4
exp7	epoch-10-avg-10.int8	8.43
exp7	epoch-15-avg-15.int8	8.43
exp7	epoch-1-avg-1.int8	9.80	Early epoch
zipformer-6000h	epoch-20-avg-10.int8	21.93	~6k-hour pretrain; different data/domain
viet_iter	epoch-12-avg-8.int8	25.79	Different training setup

Recommended for VietCasual: exp7 with epoch-4-avg-4.int8 (8.22% WER).
Test set: VietCasual (vietcasual_test_A_1000 + vietcasual_test_B_1000), casual Vietnamese and code-switched English; transcripts normalized (e.g. soe_vinorm, punctuation cleaning).

Intended use

Primary: Vietnamese ASR and Vietnamese–English code-switched speech (e.g. podcasts, interviews, casual dialogue).
Deployment: On-device or server with sherpa-onnx (INT8 ONNX); supports streaming.
Out-of-scope: Formal written text read-aloud only; other languages without adaptation; high-noise or heavily accented speech without further fine-tuning.

Training and evaluation data

Training (representative): Vietnamese cuts (e.g. vi_cuts_train), optionally VietCasualSpeech, VietCetera, VietSuccess; LibriSpeech or mixed data for pretraining/fine-tuning.
Evaluation (this card): VietCasual test A+B; WER computed with jiwer on normalized reference vs hypothesis.

Evaluation setup

Metric: Word error rate (WER), case-sensitive on normalized text.
Tooling: sherpa-onnx transducer decoder; run_wer_casual.py (or equivalent); combined wer_results.txt / wer_errors.txt per checkpoint.
Outputs: Per-utterance dataset_tag:wav_name | reference | hypothesis | wer_pct; overall WER reported at top of wer_results.txt.

Limitations and bias

Performance depends on domain and normalization; code-switched English segments can have higher error rates.
INT8 quantization may slightly affect accuracy vs FP32.
Not tested for robustness to strong accents, dialects, or very noisy environments in this card.

How to use (inference)

Example layout for sherpa-onnx (e.g. under exp7/epoch-4-avg-4.int8/):

encoder-<name>.onnx
decoder-<name>.onnx
joiner-<name>.onnx
tokens.txt

Use sherpa-onnx Transducer API with the above files and your 16 kHz (or model-native) audio. See project scripts (e.g. utils/WER/run_wer_casual.py) for batch WER.

Citation

If you use these models or results-4 metrics, please cite the training framework (e.g. Icefall/k2, Lhotse) and this card’s WER table and evaluation setup.

Summary

Field	Value
Best WER (results-4)	8.22% (exp7, epoch-4-avg-4.int8)
Test set	VietCasual A+B
Model form	ONNX INT8 transducer (Zipformer)
Languages	Vietnamese, Vietnamese–English code-switch

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support