Instructions to use manhp/asr-khmer-telephony with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use manhp/asr-khmer-telephony with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("manhp/asr-khmer-telephony") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Khmer Telephony ASR — v9 allsynth-lsall (epoch 7)
Streaming FastConformer–RNNT (BPE) acoustic model for Khmer call-center / telephony speech (8 kHz speech transcoded for G.711, 16 kHz input). This release fine-tunes on a large synthetic Khmer corpus combined with the full set of manually-annotated telephony utterances, heavily oversampled so the telephony domain dominates each epoch.
In-domain performance
Evaluated on the full ~7,600-utterance telephony set.
| Metric | Value |
|---|---|
| CER (no-space) | 4.24% |
| CER (raw, spaces counted) | 4.36% |
| Per-utterance CER bucket: < 10% | 6 743 (88.6%) |
| 10–30% | 466 (6.1%) |
| 30–50% | 135 (1.8%) |
| 50–80% | 152 (2.0%) |
| ≥ 80% | 114 (1.5%) |
Note: this run trained on the full telephony set, so the figures above overlap the training data and are not a held-out generalization estimate. They characterize in-domain fit, not out-of-sample accuracy.
WER on Khmer is a poor metric because the language has no whitespace word boundaries — CER is the meaningful number.
Model details
- Architecture: FastConformer encoder (17 layers, d_model 512, 8 heads,
causal downsampling, chunked-limited attention
[[70,13],[70,6],[70,1],[70,0]]for streaming) + RNN-T decoder + BPE/Unigram tokenizer (vocab 2 048). - Init: from an in-house Khmer FastConformer-RNNT base checkpoint.
- Fine-tuning data: a large synthetic Khmer telephony corpus mixed with the full set of ~4 000 manually-annotated short telephony utterances, the telephony portion oversampled (×15) so it forms a large share of each epoch.
- Augmentation (training only):
- G.711 transcode (prob 0.5)
- speed perturbation 0.9×–1.1× (prob 0.3)
- environmental noise mix, SNR 15–25 dB (prob 0.5)
- gain −3/+6 dBFS (prob 0.5)
- Epoch: 7 (val_wer 0.1840 — see note on WER above).
- Optimizer: AdamW, Noam-Annealing, peak LR 0.5, warmup 200 steps, weight decay 1e-3.
Files
| Path | What it is |
|---|---|
asr-khmer-telephony.nemo / epoch7.nemo |
Canonical NeMo bundle (encoder + decoder + joint + tokenizer + config). Load via EncDecRNNTBPEModel.restore_from(...). |
checkpoints/*.ckpt |
Original Lightning checkpoint (~1.4 GB) for resume / surgery. |
conf/khmer_finetune_allsynth_lsall.yaml |
Full training config (architecture + augmentation + optim). |
tokenizer/{tokenizer.model,tokenizer.vocab,vocab.txt} |
SentencePiece tokenizer (also embedded in .nemo). |
Usage
from nemo.collections.asr.models import EncDecRNNTBPEModel
model = EncDecRNNTBPEModel.restore_from("asr-khmer-telephony.nemo")
text = model.transcribe(["call_segment_8k.wav"])
print(text)
Input audio: 16 kHz mono PCM (8 kHz telephony should be upsampled to 16 kHz — the training augmentation simulates the G.711 distortion at 16 kHz).
Notes
- Streaming chunk sizes encoded in
att_context_size; 4 latency modes are bundled in the same model. - This model is domain-specialized for telephony; out-of-domain read-speech accuracy is not representative.
License
Apache-2.0 (model weights). Training data is internal.
- Downloads last month
- 95