khmerttsopensource
/

khmer-tts

Model card Files Files and versions

khmer-tts / README.md

khmerttsopensource's picture

khmerttsopensource

Add Khmer TTS model release

c9380a9 verified 25 days ago

|

history blame contribute delete

3.11 kB

	---
	license: cc-by-nc-4.0
	language:
	- km
	- khm
	tags:
	- text-to-speech
	- khmer
	- mms
	- vits
	- transformers
	pipeline_tag: text-to-audio
	base_model: facebook/mms-tts-khm
	---

	# Khmer TTS

	This repository contains a Khmer text-to-speech model fine-tuned from `facebook/mms-tts-khm`.

	The model is packaged in Hugging Face Transformers format and can be loaded with `VitsModel` and `AutoTokenizer`.

	## Files

	- `model.safetensors` - fine-tuned VITS model weights.
	- `config.json`, `vocab.json`, tokenizer files - model and tokenizer configuration.
	- `examples/inference.py` - minimal local inference script.
	- `eval/benchmark/` - generated benchmark samples, review sheet, manifest, and timing summary.
	- `training/` - training configuration and local wrapper used for this experiment.

	Raw training audio is not included in this release directory.

	## Usage

	```bash
	pip install -r requirements.txt
	python examples/inference.py --text "សួស្តីអ្នកទាំងអស់គ្នា" --output khmer_tts.wav
	```

	Or load the model directly:

	```python
	import torch
	from scipy.io.wavfile import write
	from transformers import AutoTokenizer, VitsModel

	repo_id = "khmerttsopensource/khmer-tts"
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = VitsModel.from_pretrained(repo_id)

	text = "សួស្តីអ្នកទាំងអស់គ្នា"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	waveform = model(**inputs).waveform.squeeze().cpu().numpy()

	write("khmer_tts.wav", rate=model.config.sampling_rate, data=waveform)
	```

	## Evaluation

	The included benchmark generated 50 samples.

	\| Metric \| Value \|
	\| --- \| ---: \|
	\| Success count \| 50 \|
	\| Failure count \| 0 \|
	\| Failure rate \| 0.0 \|
	\| Mean generation time \| 0.434978 seconds \|
	\| Mean audio duration \| 3.27936 seconds \|
	\| Mean RTF \| 0.136449 \|
	\| Min RTF \| 0.026531 \|
	\| Max RTF \| 0.289309 \|

	See `eval/benchmark/review_sheet.csv` for manual review fields and `eval/benchmark/generated/` for generated WAV samples.

	## Training Summary

	- Base model: `facebook/mms-tts-khm`
	- Epochs: `2`
	- Batch size: `2`
	- Sample rate: `16000`
	- Training seed: `987`

	## Limitations

	This is an experimental single-speaker Khmer TTS model. Review pronunciation, naturalness, and text fidelity before production use. The benchmark samples are generated examples, not a full safety or quality evaluation.

	## License

	This release uses `cc-by-nc-4.0`, matching the non-commercial license of the base MMS Khmer TTS model. Confirm that any downstream use complies with the base model license and the rights for the fine-tuning data.

	## Citation

	If you use this model, cite the MMS work:

	```bibtex
	@article{pratap2023mms,
	title={Scaling Speech Technology to 1,000+ Languages},
	author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and Tomasello, Paden and Babu, Arun and Kundu, Sayani and Elkahky, Ali and Ni, Zhaoheng and Vyas, Apoorv and Fazel-Zarandi, Maryam and Adi, Yossi and Zhang, Xiaohui and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael},
	journal={arXiv},
	year={2023}
	}
	```