YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Vietnamese Zipformer ASR

Model description

Model type: Automatic speech recognition (ASR) — streaming-capable transducer (Zipformer encoder + stateless decoder + joiner), exported as ONNX and quantized to INT8 for deployment.

Languages: Vietnamese (primary); handles Vietnamese–English code-switched speech.

Training framework: Icefall (k2 + Lhotse); pruned RNN-T loss; optional CTC; ScaledAdam + Eden scheduler; mixed precision (FP16/BF16).

Evaluation (results-4): WER reported on VietCasual test sets (combined A+B), normalized transcripts, jiwer-based WER.


Model variants and performance

Variant Checkpoint Overall WER (%) Notes
exp7 epoch-4-avg-4.int8 8.22 Best WER in results-4
exp7 epoch-10-avg-10.int8 8.43
exp7 epoch-15-avg-15.int8 8.43
exp7 epoch-1-avg-1.int8 9.80 Early epoch
zipformer-6000h epoch-20-avg-10.int8 21.93 ~6k-hour pretrain; different data/domain
viet_iter epoch-12-avg-8.int8 25.79 Different training setup
  • Recommended for VietCasual: exp7 with epoch-4-avg-4.int8 (8.22% WER).
  • Test set: VietCasual (vietcasual_test_A_1000 + vietcasual_test_B_1000), casual Vietnamese and code-switched English; transcripts normalized (e.g. soe_vinorm, punctuation cleaning).

Intended use

  • Primary: Vietnamese ASR and Vietnamese–English code-switched speech (e.g. podcasts, interviews, casual dialogue).
  • Deployment: On-device or server with sherpa-onnx (INT8 ONNX); supports streaming.
  • Out-of-scope: Formal written text read-aloud only; other languages without adaptation; high-noise or heavily accented speech without further fine-tuning.

Training and evaluation data

  • Training (representative): Vietnamese cuts (e.g. vi_cuts_train), optionally VietCasualSpeech, VietCetera, VietSuccess; LibriSpeech or mixed data for pretraining/fine-tuning.
  • Evaluation (this card): VietCasual test A+B; WER computed with jiwer on normalized reference vs hypothesis.

Evaluation setup

  • Metric: Word error rate (WER), case-sensitive on normalized text.
  • Tooling: sherpa-onnx transducer decoder; run_wer_casual.py (or equivalent); combined wer_results.txt / wer_errors.txt per checkpoint.
  • Outputs: Per-utterance dataset_tag:wav_name | reference | hypothesis | wer_pct; overall WER reported at top of wer_results.txt.

Limitations and bias

  • Performance depends on domain and normalization; code-switched English segments can have higher error rates.
  • INT8 quantization may slightly affect accuracy vs FP32.
  • Not tested for robustness to strong accents, dialects, or very noisy environments in this card.

How to use (inference)

Example layout for sherpa-onnx (e.g. under exp7/epoch-4-avg-4.int8/):

  • encoder-<name>.onnx
  • decoder-<name>.onnx
  • joiner-<name>.onnx
  • tokens.txt

Use sherpa-onnx Transducer API with the above files and your 16 kHz (or model-native) audio. See project scripts (e.g. utils/WER/run_wer_casual.py) for batch WER.


Citation

If you use these models or results-4 metrics, please cite the training framework (e.g. Icefall/k2, Lhotse) and this card’s WER table and evaluation setup.


Summary

Field Value
Best WER (results-4) 8.22% (exp7, epoch-4-avg-4.int8)
Test set VietCasual A+B
Model form ONNX INT8 transducer (Zipformer)
Languages Vietnamese, Vietnamese–English code-switch
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support