Instructions to use khmerttsopensource/khmer-tts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use khmerttsopensource/khmer-tts with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="khmerttsopensource/khmer-tts")# Load model directly from transformers import AutoTokenizer, AutoModelForPreTraining tokenizer = AutoTokenizer.from_pretrained("khmerttsopensource/khmer-tts") model = AutoModelForPreTraining.from_pretrained("khmerttsopensource/khmer-tts") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-4.0 | |
| language: | |
| - km | |
| - khm | |
| tags: | |
| - text-to-speech | |
| - khmer | |
| - mms | |
| - vits | |
| - transformers | |
| pipeline_tag: text-to-audio | |
| base_model: facebook/mms-tts-khm | |
| # Khmer TTS | |
| This repository contains a Khmer text-to-speech model fine-tuned from `facebook/mms-tts-khm`. | |
| The model is packaged in Hugging Face Transformers format and can be loaded with `VitsModel` and `AutoTokenizer`. | |
| ## Files | |
| - `model.safetensors` - fine-tuned VITS model weights. | |
| - `config.json`, `vocab.json`, tokenizer files - model and tokenizer configuration. | |
| - `examples/inference.py` - minimal local inference script. | |
| - `eval/benchmark/` - generated benchmark samples, review sheet, manifest, and timing summary. | |
| - `training/` - training configuration and local wrapper used for this experiment. | |
| Raw training audio is not included in this release directory. | |
| ## Usage | |
| ```bash | |
| pip install -r requirements.txt | |
| python examples/inference.py --text "សួស្តីអ្នកទាំងអស់គ្នា" --output khmer_tts.wav | |
| ``` | |
| Or load the model directly: | |
| ```python | |
| import torch | |
| from scipy.io.wavfile import write | |
| from transformers import AutoTokenizer, VitsModel | |
| repo_id = "khmerttsopensource/khmer-tts" | |
| tokenizer = AutoTokenizer.from_pretrained(repo_id) | |
| model = VitsModel.from_pretrained(repo_id) | |
| text = "សួស្តីអ្នកទាំងអស់គ្នា" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| with torch.no_grad(): | |
| waveform = model(**inputs).waveform.squeeze().cpu().numpy() | |
| write("khmer_tts.wav", rate=model.config.sampling_rate, data=waveform) | |
| ``` | |
| ## Evaluation | |
| The included benchmark generated 50 samples. | |
| | Metric | Value | | |
| | --- | ---: | | |
| | Success count | 50 | | |
| | Failure count | 0 | | |
| | Failure rate | 0.0 | | |
| | Mean generation time | 0.434978 seconds | | |
| | Mean audio duration | 3.27936 seconds | | |
| | Mean RTF | 0.136449 | | |
| | Min RTF | 0.026531 | | |
| | Max RTF | 0.289309 | | |
| See `eval/benchmark/review_sheet.csv` for manual review fields and `eval/benchmark/generated/` for generated WAV samples. | |
| ## Training Summary | |
| - Base model: `facebook/mms-tts-khm` | |
| - Epochs: `2` | |
| - Batch size: `2` | |
| - Sample rate: `16000` | |
| - Training seed: `987` | |
| ## Limitations | |
| This is an experimental single-speaker Khmer TTS model. Review pronunciation, naturalness, and text fidelity before production use. The benchmark samples are generated examples, not a full safety or quality evaluation. | |
| ## License | |
| This release uses `cc-by-nc-4.0`, matching the non-commercial license of the base MMS Khmer TTS model. Confirm that any downstream use complies with the base model license and the rights for the fine-tuning data. | |
| ## Citation | |
| If you use this model, cite the MMS work: | |
| ```bibtex | |
| @article{pratap2023mms, | |
| title={Scaling Speech Technology to 1,000+ Languages}, | |
| author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and Tomasello, Paden and Babu, Arun and Kundu, Sayani and Elkahky, Ali and Ni, Zhaoheng and Vyas, Apoorv and Fazel-Zarandi, Maryam and Adi, Yossi and Zhang, Xiaohui and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael}, | |
| journal={arXiv}, | |
| year={2023} | |
| } | |
| ``` | |