| # Vietnamese Zipformer ASR |
|
|
| ## Model description |
|
|
| **Model type:** Automatic speech recognition (ASR) — streaming-capable transducer (Zipformer encoder + stateless decoder + joiner), exported as ONNX and quantized to INT8 for deployment. |
|
|
| **Languages:** Vietnamese (primary); handles Vietnamese–English code-switched speech. |
|
|
| **Training framework:** Icefall (k2 + Lhotse); pruned RNN-T loss; optional CTC; ScaledAdam + Eden scheduler; mixed precision (FP16/BF16). |
|
|
| **Evaluation (results-4):** WER reported on VietCasual test sets (combined A+B), normalized transcripts, jiwer-based WER. |
|
|
| --- |
|
|
| ## Model variants and performance |
|
|
| | Variant | Checkpoint | Overall WER (%) | Notes | |
| |--------|------------|------------------|-------| |
| | **exp7** | epoch-4-avg-4.int8 | **8.22** | Best WER in results-4 | |
| | **exp7** | epoch-10-avg-10.int8 | 8.43 | | |
| | **exp7** | epoch-15-avg-15.int8 | 8.43 | | |
| | **exp7** | epoch-1-avg-1.int8 | 9.80 | Early epoch | |
| | **zipformer-6000h** | epoch-20-avg-10.int8 | 21.93 | ~6k-hour pretrain; different data/domain | |
| | **viet_iter** | epoch-12-avg-8.int8 | 25.79 | Different training setup | |
| |
| - **Recommended for VietCasual:** **exp7** with **epoch-4-avg-4.int8** (8.22% WER). |
| - Test set: VietCasual (vietcasual_test_A_1000 + vietcasual_test_B_1000), casual Vietnamese and code-switched English; transcripts normalized (e.g. soe_vinorm, punctuation cleaning). |
| |
| --- |
| |
| ## Intended use |
| |
| - **Primary:** Vietnamese ASR and Vietnamese–English code-switched speech (e.g. podcasts, interviews, casual dialogue). |
| - **Deployment:** On-device or server with sherpa-onnx (INT8 ONNX); supports streaming. |
| - **Out-of-scope:** Formal written text read-aloud only; other languages without adaptation; high-noise or heavily accented speech without further fine-tuning. |
|
|
| --- |
|
|
| ## Training and evaluation data |
|
|
| - **Training (representative):** Vietnamese cuts (e.g. vi_cuts_train), optionally VietCasualSpeech, VietCetera, VietSuccess; LibriSpeech or mixed data for pretraining/fine-tuning. |
| - **Evaluation (this card):** VietCasual test A+B; WER computed with jiwer on normalized reference vs hypothesis. |
|
|
| --- |
|
|
| ## Evaluation setup |
|
|
| - **Metric:** Word error rate (WER), case-sensitive on normalized text. |
| - **Tooling:** sherpa-onnx transducer decoder; `run_wer_casual.py` (or equivalent); combined `wer_results.txt` / `wer_errors.txt` per checkpoint. |
| - **Outputs:** Per-utterance `dataset_tag:wav_name | reference | hypothesis | wer_pct`; overall WER reported at top of `wer_results.txt`. |
|
|
| --- |
|
|
| ## Limitations and bias |
|
|
| - Performance depends on domain and normalization; code-switched English segments can have higher error rates. |
| - INT8 quantization may slightly affect accuracy vs FP32. |
| - Not tested for robustness to strong accents, dialects, or very noisy environments in this card. |
|
|
| --- |
|
|
| ## How to use (inference) |
|
|
| Example layout for sherpa-onnx (e.g. under `exp7/epoch-4-avg-4.int8/`): |
|
|
| - `encoder-<name>.onnx` |
| - `decoder-<name>.onnx` |
| - `joiner-<name>.onnx` |
| - `tokens.txt` |
|
|
| Use sherpa-onnx Transducer API with the above files and your 16 kHz (or model-native) audio. See project scripts (e.g. `utils/WER/run_wer_casual.py`) for batch WER. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use these models or results-4 metrics, please cite the training framework (e.g. Icefall/k2, Lhotse) and this card’s WER table and evaluation setup. |
|
|
| --- |
|
|
| ## Summary |
|
|
| | Field | Value | |
| |-------|--------| |
| | Best WER (results-4) | **8.22%** (exp7, epoch-4-avg-4.int8) | |
| | Test set | VietCasual A+B | |
| | Model form | ONNX INT8 transducer (Zipformer) | |
| | Languages | Vietnamese, Vietnamese–English code-switch | |
|
|