NghiASR / README.md

Update README.md

6192fa5 verified about 1 month ago

3.59 kB

	# Vietnamese Zipformer ASR

	## Model description

	Model type: Automatic speech recognition (ASR) — streaming-capable transducer (Zipformer encoder + stateless decoder + joiner), exported as ONNX and quantized to INT8 for deployment.

	Languages: Vietnamese (primary); handles Vietnamese–English code-switched speech.

	Training framework: Icefall (k2 + Lhotse); pruned RNN-T loss; optional CTC; ScaledAdam + Eden scheduler; mixed precision (FP16/BF16).

	Evaluation (results-4): WER reported on VietCasual test sets (combined A+B), normalized transcripts, jiwer-based WER.

	---

	## Model variants and performance

	\| Variant \| Checkpoint \| Overall WER (%) \| Notes \|
	\|--------\|------------\|------------------\|-------\|
	\| exp7 \| epoch-4-avg-4.int8 \| 8.22 \| Best WER in results-4 \|
	\| exp7 \| epoch-10-avg-10.int8 \| 8.43 \| \|
	\| exp7 \| epoch-15-avg-15.int8 \| 8.43 \| \|
	\| exp7 \| epoch-1-avg-1.int8 \| 9.80 \| Early epoch \|
	\| zipformer-6000h \| epoch-20-avg-10.int8 \| 21.93 \| ~6k-hour pretrain; different data/domain \|
	\| viet_iter \| epoch-12-avg-8.int8 \| 25.79 \| Different training setup \|

	- Recommended for VietCasual: exp7 with epoch-4-avg-4.int8 (8.22% WER).
	- Test set: VietCasual (vietcasual_test_A_1000 + vietcasual_test_B_1000), casual Vietnamese and code-switched English; transcripts normalized (e.g. soe_vinorm, punctuation cleaning).

	---

	## Intended use

	- Primary: Vietnamese ASR and Vietnamese–English code-switched speech (e.g. podcasts, interviews, casual dialogue).
	- Deployment: On-device or server with sherpa-onnx (INT8 ONNX); supports streaming.
	- Out-of-scope: Formal written text read-aloud only; other languages without adaptation; high-noise or heavily accented speech without further fine-tuning.

	---

	## Training and evaluation data

	- Training (representative): Vietnamese cuts (e.g. vi_cuts_train), optionally VietCasualSpeech, VietCetera, VietSuccess; LibriSpeech or mixed data for pretraining/fine-tuning.
	- Evaluation (this card): VietCasual test A+B; WER computed with jiwer on normalized reference vs hypothesis.

	---

	## Evaluation setup

	- Metric: Word error rate (WER), case-sensitive on normalized text.
	- Tooling: sherpa-onnx transducer decoder; `run_wer_casual.py` (or equivalent); combined `wer_results.txt` / `wer_errors.txt` per checkpoint.
	- Outputs: Per-utterance `dataset_tag:wav_name \| reference \| hypothesis \| wer_pct`; overall WER reported at top of `wer_results.txt`.

	---

	## Limitations and bias

	- Performance depends on domain and normalization; code-switched English segments can have higher error rates.
	- INT8 quantization may slightly affect accuracy vs FP32.
	- Not tested for robustness to strong accents, dialects, or very noisy environments in this card.

	---

	## How to use (inference)

	Example layout for sherpa-onnx (e.g. under `exp7/epoch-4-avg-4.int8/`):

	- `encoder-<name>.onnx`
	- `decoder-<name>.onnx`
	- `joiner-<name>.onnx`
	- `tokens.txt`

	Use sherpa-onnx Transducer API with the above files and your 16 kHz (or model-native) audio. See project scripts (e.g. `utils/WER/run_wer_casual.py`) for batch WER.

	---

	## Citation

	If you use these models or results-4 metrics, please cite the training framework (e.g. Icefall/k2, Lhotse) and this card’s WER table and evaluation setup.

	---

	## Summary

	\| Field \| Value \|
	\|-------\|--------\|
	\| Best WER (results-4) \| 8.22% (exp7, epoch-4-avg-4.int8) \|
	\| Test set \| VietCasual A+B \|
	\| Model form \| ONNX INT8 transducer (Zipformer) \|
	\| Languages \| Vietnamese, Vietnamese–English code-switch \|