--- license: cc-by-nc-4.0 language: - km - khm tags: - text-to-speech - khmer - mms - vits - transformers pipeline_tag: text-to-audio base_model: facebook/mms-tts-khm --- # Khmer TTS This repository contains a Khmer text-to-speech model fine-tuned from `facebook/mms-tts-khm`. The model is packaged in Hugging Face Transformers format and can be loaded with `VitsModel` and `AutoTokenizer`. ## Files - `model.safetensors` - fine-tuned VITS model weights. - `config.json`, `vocab.json`, tokenizer files - model and tokenizer configuration. - `examples/inference.py` - minimal local inference script. - `eval/benchmark/` - generated benchmark samples, review sheet, manifest, and timing summary. - `training/` - training configuration and local wrapper used for this experiment. Raw training audio is not included in this release directory. ## Usage ```bash pip install -r requirements.txt python examples/inference.py --text "សួស្តីអ្នកទាំងអស់គ្នា" --output khmer_tts.wav ``` Or load the model directly: ```python import torch from scipy.io.wavfile import write from transformers import AutoTokenizer, VitsModel repo_id = "khmerttsopensource/khmer-tts" tokenizer = AutoTokenizer.from_pretrained(repo_id) model = VitsModel.from_pretrained(repo_id) text = "សួស្តីអ្នកទាំងអស់គ្នា" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): waveform = model(**inputs).waveform.squeeze().cpu().numpy() write("khmer_tts.wav", rate=model.config.sampling_rate, data=waveform) ``` ## Evaluation The included benchmark generated 50 samples. | Metric | Value | | --- | ---: | | Success count | 50 | | Failure count | 0 | | Failure rate | 0.0 | | Mean generation time | 0.434978 seconds | | Mean audio duration | 3.27936 seconds | | Mean RTF | 0.136449 | | Min RTF | 0.026531 | | Max RTF | 0.289309 | See `eval/benchmark/review_sheet.csv` for manual review fields and `eval/benchmark/generated/` for generated WAV samples. ## Training Summary - Base model: `facebook/mms-tts-khm` - Epochs: `2` - Batch size: `2` - Sample rate: `16000` - Training seed: `987` ## Limitations This is an experimental single-speaker Khmer TTS model. Review pronunciation, naturalness, and text fidelity before production use. The benchmark samples are generated examples, not a full safety or quality evaluation. ## License This release uses `cc-by-nc-4.0`, matching the non-commercial license of the base MMS Khmer TTS model. Confirm that any downstream use complies with the base model license and the rights for the fine-tuning data. ## Citation If you use this model, cite the MMS work: ```bibtex @article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and Tomasello, Paden and Babu, Arun and Kundu, Sayani and Elkahky, Ali and Ni, Zhaoheng and Vyas, Apoorv and Fazel-Zarandi, Maryam and Adi, Yossi and Zhang, Xiaohui and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael}, journal={arXiv}, year={2023} } ```