Khmer Telephony ASR — v9 allsynth-lsall (epoch 7)

Streaming FastConformer–RNNT (BPE) acoustic model for Khmer call-center / telephony speech (8 kHz speech transcoded for G.711, 16 kHz input). This release fine-tunes on a large synthetic Khmer corpus combined with the full set of manually-annotated telephony utterances, heavily oversampled so the telephony domain dominates each epoch.

In-domain performance

Evaluated on the full ~7,600-utterance telephony set.

Metric	Value
CER (no-space)	4.24%
CER (raw, spaces counted)	4.36%
Per-utterance CER bucket: < 10%	6 743 (88.6%)
10–30%	466 (6.1%)
30–50%	135 (1.8%)
50–80%	152 (2.0%)
≥ 80%	114 (1.5%)

Note: this run trained on the full telephony set, so the figures above overlap the training data and are not a held-out generalization estimate. They characterize in-domain fit, not out-of-sample accuracy.

WER on Khmer is a poor metric because the language has no whitespace word boundaries — CER is the meaningful number.

Model details

Architecture: FastConformer encoder (17 layers, d_model 512, 8 heads, causal downsampling, chunked-limited attention [[70,13],[70,6],[70,1],[70,0]] for streaming) + RNN-T decoder + BPE/Unigram tokenizer (vocab 2 048).
Init: from an in-house Khmer FastConformer-RNNT base checkpoint.
Fine-tuning data: a large synthetic Khmer telephony corpus mixed with the full set of ~4 000 manually-annotated short telephony utterances, the telephony portion oversampled (×15) so it forms a large share of each epoch.
Augmentation (training only):
- G.711 transcode (prob 0.5)
- speed perturbation 0.9×–1.1× (prob 0.3)
- environmental noise mix, SNR 15–25 dB (prob 0.5)
- gain −3/+6 dBFS (prob 0.5)
Epoch: 7 (val_wer 0.1840 — see note on WER above).
Optimizer: AdamW, Noam-Annealing, peak LR 0.5, warmup 200 steps, weight decay 1e-3.

Files

Path	What it is
`asr-khmer-telephony.nemo` / `epoch7.nemo`	Canonical NeMo bundle (encoder + decoder + joint + tokenizer + config). Load via `EncDecRNNTBPEModel.restore_from(...)`.
`checkpoints/*.ckpt`	Original Lightning checkpoint (~1.4 GB) for resume / surgery.
`conf/khmer_finetune_allsynth_lsall.yaml`	Full training config (architecture + augmentation + optim).
`tokenizer/{tokenizer.model,tokenizer.vocab,vocab.txt}`	SentencePiece tokenizer (also embedded in `.nemo`).

Usage

from nemo.collections.asr.models import EncDecRNNTBPEModel

model = EncDecRNNTBPEModel.restore_from("asr-khmer-telephony.nemo")
text = model.transcribe(["call_segment_8k.wav"])
print(text)

Input audio: 16 kHz mono PCM (8 kHz telephony should be upsampled to 16 kHz — the training augmentation simulates the G.711 distortion at 16 kHz).

Notes

Streaming chunk sizes encoded in att_context_size; 4 latency modes are bundled in the same model.
This model is domain-specialized for telephony; out-of-domain read-speech accuracy is not representative.

License

Apache-2.0 (model weights). Training data is internal.

Downloads last month: 95