Instructions to use actableai/fastconformer-rnnt-bpe-streaming-khmer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use actableai/fastconformer-rnnt-bpe-streaming-khmer with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("actableai/fastconformer-rnnt-bpe-streaming-khmer") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
FastConformer-RNNT-BPE-Streaming-Khmer
Cache-aware streaming FastConformer RNN-T model for Khmer ASR, trained with NVIDIA NeMo. This is the v9b checkpoint.
- Architecture: FastConformer encoder + RNN-T decoder & joint (no CTC head)
- Tokenizer: SentencePiece unigram BPE, vocab size 4096, trained on the v9 Khmer no-space training manifest filtered to 0.1β10.0 s
- Sample rate: 16 kHz
- Streaming: cache-aware, multi-latency
- Best checkpoint: val_wer = 0.1616 @ epoch 108 (no-space internal val set)
Files
FastConformer-RNNT-BPE-Streaming-Khmer.nemo/epoch108.nemoβ single-file NeMo bundle (weights + tokenizer + config). Identical contents; the unversioned name is the canonical entry point. Load with NeMo.checkpoints/FastConformer-RNNT-BPE-Streaming-Khmer-v9b--val_wer=0.1616-epoch=108.ckptβ Lightning checkpoint, useful for resuming training.conf/khmer_finetune_v9b.yamlβ training config for v9b.tokenizer/β standalone copy of the SentencePiece BPE tokenizer (tokenizer.model,tokenizer.vocab,vocab.txt).
Usage
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(
"actableai/fastconformer-rnnt-bpe-streaming-khmer"
)
transcripts = model.transcribe(["audio.wav"])
print(transcripts)
For cache-aware streaming inference, see NeMo's
examples/asr/asr_cache_aware_streaming/ scripts.
Training
- Init: encoder weights warm-started from the v9
.nemo; decoder + joint reinitialized because the tokenizer changed (vocab 4096 BPE). - Hardware: 4 Γ A100 40 GB, bf16.
- Loss: RNN-T (warp-rnnt, fastemit Ξ» = 0.002).
- Data window: 0.1β10.0 s, sampled from in-house Khmer manifests with whitespace removed from transcripts.
- Augmentation pipeline (each step has its own probability): speed perturbation (0.77β1.00) β noise injection (env, 5β20 dB SNR) β babble injection (Khmer, 0.4β1.5 s segments) β G.711 transcode β gain (β20 to 0 dBFS).
Evaluation
Internal validation (no-space transcripts), top epochs:
| epoch | val_wer |
|---|---|
| 108 | 0.1616 (selected) |
| 112 | 0.1633 |
| 110 | 0.1644 |
| 106 | 0.1669 |
External test sets (numbers from epoch 106, the closest epoch with a clean external eval):
| set | n | CER (raw) | CER (no-space) | WER |
|---|---|---|---|---|
| FLEURS test (Khmer) | 771 | 0.296 | 0.274 | 0.999 |
| Label Studio Metfone (in-house calls) | 1820 | 0.394 | 0.372 | 0.839 |
The model is trained without spaces, so word-level WER on space-segmented references (FLEURS) is not meaningful β use CER (no-space) as the primary external metric.
Limitations
- Trained on a mix of in-house Khmer call data and synthetic speech. Quality varies on out-of-domain audio, code-switched speech, or non-call noise profiles.
- Output is whitespace-free; downstream consumers that need word-level segmentation must add a separate segmenter.
License
Released under CC-BY-4.0.
- Downloads last month
- 101