Khmer ASR Encoder Benchmark and Pretrained Checkpoints
This repository contains pretrained checkpoints and benchmarking results used to identify the most suitable speech encoder backbone for a Khmer Automatic Speech Recognition (ASR) system.
The goal of this project was to compare the transferability of different self-supervised speech representations for Khmer ASR before investing in large-scale fine-tuning.
Overview
Three widely used pretrained speech encoders were evaluated:
- Whisper Base (
openai/whisper-base) - WavLM Base
- Wav2Vec2 Base
To ensure a fair comparison, each encoder was:
- Initialized from publicly available pretrained weights.
- Frozen during training.
- Connected to a lightweight CTC classification head.
- Trained and evaluated using the same Khmer speech datasets.
The resulting validation CTC loss was used as the primary evaluation metric.
Training Methodology
Phase 1: Frozen Encoder Evaluation
For each backbone:
- The encoder parameters were frozen.
- A randomly initialized CTC head was attached.
- Only the CTC head was trained.
- Validation CTC loss was used to assess encoder quality.
This approach evaluates how useful the pretrained speech representations are for Khmer ASR without any encoder fine-tuning.
Results
| Encoder | Validation CTC Loss |
|---|---|
| Whisper Base | 0.5663 |
| WavLM Base | 0.6031 |
| Wav2Vec2 Base | 0.7836 |
Ranking
π₯ Whisper Base β 0.5663
π₯ WavLM Base β 0.6031
π₯ Wav2Vec2 Base β 0.7836
Based on these experiments, Whisper Base produced the strongest transferable speech representations for Khmer ASR and was selected as the backbone for subsequent training stages.
Selected Backbone
{
"best_backbone_key": "whisper-base",
"best_backbone_id": "openai/whisper-base",
"hidden_size": 512,
"checkpoint_path": "./whisper-base_best.pt"
}
Best Checkpoint
whisper-base_best.pt
Training Datasets
The benchmark used a combination of multiple Khmer speech datasets.
1. Khmer GRKPP Speech
Dataset:
seanghay/khmer_grkpp_speech
2. KM Speech Corpus
Dataset:
seanghay/km-speech-corpus
3. FLEURS + OpenSLR42 + MPWT
Dataset:
KrorngAI/fleurs_openslr42_mpwt
For balanced experimentation, up to 3,000 samples per dataset were used during backbone evaluation.
Vocabulary
The model uses a character-level Khmer vocabulary containing:
- Khmer consonants
- Khmer vowels
- Khmer diacritics
- Khmer numerals
- Khmer punctuation marks
- Special CTC tokens
Special Tokens
[BLANK]
[PAD]
[MASK]
[UNK]
Vocabulary Size
108 tokens
Repository Contents
.
βββ whisper-base_best.pt
βββ README.md
βββ benchmark metadata
Experimental Goal
This repository is not intended to be a production-ready ASR system.
Instead, it provides:
- A Khmer ASR encoder benchmark
- Pretrained CTC evaluation checkpoints
- A comparison between Whisper, WavLM, and Wav2Vec2 representations
- A foundation for future Khmer ASR research and fine-tuning
Key Findings
- Whisper Base achieved the lowest validation CTC loss.
- WavLM Base performed competitively and ranked second.
- Wav2Vec2 Base showed weaker transferability on the evaluated Khmer datasets.
- Frozen encoder evaluation provides an efficient way to compare speech backbones before expensive full-model training.
Future Work
Planned improvements include:
- Full encoder fine-tuning
- Larger Khmer speech datasets
- Language model integration
- Beam search decoding
- Character Error Rate (CER) evaluation
- Word Error Rate (WER) evaluation
- Deployment-ready Khmer ASR models
Citation
If you use this repository in your research, please cite:
@misc{uk2026khmerasrbenchmark,
title={Khmer ASR Encoder Benchmark and Pretrained Checkpoints},
author={Uk, Panhapich},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/Panhapich/pre-trained_whisper_wavLM}
}
Author
Panhapich Uk
Independent research project focused on:
- Khmer Automatic Speech Recognition (ASR)
- Speech representation learning
- Low-resource language technologies
- Self-supervised speech models
License
This project is released under the MIT License.