khmer-tts / README.md
khmerttsopensource's picture
Add Khmer TTS model release
c9380a9 verified
metadata
license: cc-by-nc-4.0
language:
  - km
  - khm
tags:
  - text-to-speech
  - khmer
  - mms
  - vits
  - transformers
pipeline_tag: text-to-audio
base_model: facebook/mms-tts-khm

Khmer TTS

This repository contains a Khmer text-to-speech model fine-tuned from facebook/mms-tts-khm.

The model is packaged in Hugging Face Transformers format and can be loaded with VitsModel and AutoTokenizer.

Files

  • model.safetensors - fine-tuned VITS model weights.
  • config.json, vocab.json, tokenizer files - model and tokenizer configuration.
  • examples/inference.py - minimal local inference script.
  • eval/benchmark/ - generated benchmark samples, review sheet, manifest, and timing summary.
  • training/ - training configuration and local wrapper used for this experiment.

Raw training audio is not included in this release directory.

Usage

pip install -r requirements.txt
python examples/inference.py --text "សួស្តីអ្នកទាំងអស់គ្នា" --output khmer_tts.wav

Or load the model directly:

import torch
from scipy.io.wavfile import write
from transformers import AutoTokenizer, VitsModel

repo_id = "khmerttsopensource/khmer-tts"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = VitsModel.from_pretrained(repo_id)

text = "សួស្តីអ្នកទាំងអស់គ្នា"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    waveform = model(**inputs).waveform.squeeze().cpu().numpy()

write("khmer_tts.wav", rate=model.config.sampling_rate, data=waveform)

Evaluation

The included benchmark generated 50 samples.

Metric Value
Success count 50
Failure count 0
Failure rate 0.0
Mean generation time 0.434978 seconds
Mean audio duration 3.27936 seconds
Mean RTF 0.136449
Min RTF 0.026531
Max RTF 0.289309

See eval/benchmark/review_sheet.csv for manual review fields and eval/benchmark/generated/ for generated WAV samples.

Training Summary

  • Base model: facebook/mms-tts-khm
  • Epochs: 2
  • Batch size: 2
  • Sample rate: 16000
  • Training seed: 987

Limitations

This is an experimental single-speaker Khmer TTS model. Review pronunciation, naturalness, and text fidelity before production use. The benchmark samples are generated examples, not a full safety or quality evaluation.

License

This release uses cc-by-nc-4.0, matching the non-commercial license of the base MMS Khmer TTS model. Confirm that any downstream use complies with the base model license and the rights for the fine-tuning data.

Citation

If you use this model, cite the MMS work:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and Tomasello, Paden and Babu, Arun and Kundu, Sayani and Elkahky, Ali and Ni, Zhaoheng and Vyas, Apoorv and Fazel-Zarandi, Maryam and Adi, Yossi and Zhang, Xiaohui and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael},
  journal={arXiv},
  year={2023}
}