🇰🇭 Khmer Speech Recognition — Wav2Vec2 XLS-R (50 h)

Fine-tuned facebook/wav2vec2-xls-r-300m on the first 50 hours of a Khmer speech corpus using CTC decoding.

Model details

Property	Value
Base model	`facebook/wav2vec2-xls-r-300m`
Language	Khmer / ភាសាខ្មែរ (`km`)
Task	Automatic Speech Recognition (ASR)
Training data	First 50 hours of Khmer audio
Input sample rate	16 kHz, mono
Architecture	Wav2Vec2 + CTC head
Framework	🤗 Transformers

How to use

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch

# Load model
processor = Wav2Vec2Processor.from_pretrained("Vatho/wav2vec2-khmer-xls-r-50h")
model     = Wav2Vec2ForCTC.from_pretrained("Vatho/wav2vec2-khmer-xls-r-50h")
model.eval()

# Load audio (must be 16 kHz mono WAV)
audio, sr = sf.read("your_audio.wav")

# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(predicted_ids)[0])

Text normalisation

Only characters in the Khmer Unicode block (U+1780–U+17FF) are kept. All punctuation, Latin characters, and extra whitespace are stripped before scoring.

Limitations

Trained on 50 h of data — performance may degrade on out-of-domain or noisy speech.
No language model rescoring is applied at decode time.
Best results on clean, 16 kHz recordings.

Training details

Setting	Value
Optimizer	AdamW
Base learning rate	3e-4
Batch size	16
Max steps	20,000
Warmup steps	2,000
CTC loss	✓
Early stopping	✓

Citation

If you use this model, please cite the base model:

@article{babu2021xls,
  title     = {XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale},
  author    = {Babu, Arun and Wang, Changhan and Tjandra, Andros and others},
  journal   = {arXiv preprint arXiv:2111.09296},
  year      = {2021}
}

Downloads last month: 51

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Vatho/Khmer-ocm

Base model

facebook/wav2vec2-xls-r-300m

Finetuned

(882)

this model

Space using Vatho/Khmer-ocm 1

Paper for Vatho/Khmer-ocm

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Paper • 2111.09296 • Published Nov 17, 2021 • 3