XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
Paper β’ 2111.09296 β’ Published β’ 3
Fine-tuned facebook/wav2vec2-xls-r-300m
on the first 50 hours of a Khmer speech corpus using CTC decoding.
| Property | Value |
|---|---|
| Base model | facebook/wav2vec2-xls-r-300m |
| Language | Khmer / ααΆααΆααααα (km) |
| Task | Automatic Speech Recognition (ASR) |
| Training data | First 50 hours of Khmer audio |
| Input sample rate | 16 kHz, mono |
| Architecture | Wav2Vec2 + CTC head |
| Framework | π€ Transformers |
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import soundfile as sf
import torch
# Load model
processor = Wav2Vec2Processor.from_pretrained("Vatho/wav2vec2-khmer-xls-r-50h")
model = Wav2Vec2ForCTC.from_pretrained("Vatho/wav2vec2-khmer-xls-r-50h")
model.eval()
# Load audio (must be 16 kHz mono WAV)
audio, sr = sf.read("your_audio.wav")
# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(predicted_ids)[0])
Only characters in the Khmer Unicode block (U+1780βU+17FF) are kept. All punctuation, Latin characters, and extra whitespace are stripped before scoring.
| Setting | Value |
|---|---|
| Optimizer | AdamW |
| Base learning rate | 3e-4 |
| Batch size | 16 |
| Max steps | 20,000 |
| Warmup steps | 2,000 |
| CTC loss | β |
| Early stopping | β |
If you use this model, please cite the base model:
@article{babu2021xls,
title = {XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale},
author = {Babu, Arun and Wang, Changhan and Tjandra, Andros and others},
journal = {arXiv preprint arXiv:2111.09296},
year = {2021}
}
Base model
facebook/wav2vec2-xls-r-300m