Khmer ASR Encoder Benchmark and Pretrained Checkpoints

This repository contains pretrained checkpoints and benchmarking results used to identify the most suitable speech encoder backbone for a Khmer Automatic Speech Recognition (ASR) system.

The goal of this project was to compare the transferability of different self-supervised speech representations for Khmer ASR before investing in large-scale fine-tuning.

Overview

Three widely used pretrained speech encoders were evaluated:

Whisper Base (openai/whisper-base)
WavLM Base
Wav2Vec2 Base

To ensure a fair comparison, each encoder was:

Initialized from publicly available pretrained weights.
Frozen during training.
Connected to a lightweight CTC classification head.
Trained and evaluated using the same Khmer speech datasets.

The resulting validation CTC loss was used as the primary evaluation metric.

Training Methodology

Phase 1: Frozen Encoder Evaluation

For each backbone:

The encoder parameters were frozen.
A randomly initialized CTC head was attached.
Only the CTC head was trained.
Validation CTC loss was used to assess encoder quality.

This approach evaluates how useful the pretrained speech representations are for Khmer ASR without any encoder fine-tuning.

Results

Encoder	Validation CTC Loss
Whisper Base	0.5663
WavLM Base	0.6031
Wav2Vec2 Base	0.7836

Ranking

🥇 Whisper Base — 0.5663

🥈 WavLM Base — 0.6031

🥉 Wav2Vec2 Base — 0.7836

Based on these experiments, Whisper Base produced the strongest transferable speech representations for Khmer ASR and was selected as the backbone for subsequent training stages.

Selected Backbone

{
  "best_backbone_key": "whisper-base",
  "best_backbone_id": "openai/whisper-base",
  "hidden_size": 512,
  "checkpoint_path": "./whisper-base_best.pt"
}

Best Checkpoint

whisper-base_best.pt

Training Datasets

The benchmark used a combination of multiple Khmer speech datasets.

1. Khmer GRKPP Speech

Dataset:

seanghay/khmer_grkpp_speech

2. KM Speech Corpus

Dataset:

seanghay/km-speech-corpus

3. FLEURS + OpenSLR42 + MPWT

Dataset:

KrorngAI/fleurs_openslr42_mpwt

For balanced experimentation, up to 3,000 samples per dataset were used during backbone evaluation.

Vocabulary

The model uses a character-level Khmer vocabulary containing:

Khmer consonants
Khmer vowels
Khmer diacritics
Khmer numerals
Khmer punctuation marks
Special CTC tokens

Special Tokens

[BLANK]
[PAD]
[MASK]
[UNK]

Vocabulary Size

108 tokens

Repository Contents

.
├── whisper-base_best.pt
├── README.md
└── benchmark metadata

Experimental Goal

This repository is not intended to be a production-ready ASR system.

Instead, it provides:

A Khmer ASR encoder benchmark
Pretrained CTC evaluation checkpoints
A comparison between Whisper, WavLM, and Wav2Vec2 representations
A foundation for future Khmer ASR research and fine-tuning

Key Findings

Whisper Base achieved the lowest validation CTC loss.
WavLM Base performed competitively and ranked second.
Wav2Vec2 Base showed weaker transferability on the evaluated Khmer datasets.
Frozen encoder evaluation provides an efficient way to compare speech backbones before expensive full-model training.

Future Work

Planned improvements include:

Full encoder fine-tuning
Larger Khmer speech datasets
Language model integration
Beam search decoding
Character Error Rate (CER) evaluation
Word Error Rate (WER) evaluation
Deployment-ready Khmer ASR models

Citation

If you use this repository in your research, please cite:

@misc{uk2026khmerasrbenchmark,
  title={Khmer ASR Encoder Benchmark and Pretrained Checkpoints},
  author={Uk, Panhapich},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Panhapich/pre-trained_whisper_wavLM}
}

Author

Panhapich Uk

Independent research project focused on:

Khmer Automatic Speech Recognition (ASR)
Speech representation learning
Low-resource language technologies
Self-supervised speech models

License

This project is released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track