File size: 3,108 Bytes
c9380a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: cc-by-nc-4.0
language:
  - km
  - khm
tags:
  - text-to-speech
  - khmer
  - mms
  - vits
  - transformers
pipeline_tag: text-to-audio
base_model: facebook/mms-tts-khm
---

# Khmer TTS

This repository contains a Khmer text-to-speech model fine-tuned from `facebook/mms-tts-khm`.

The model is packaged in Hugging Face Transformers format and can be loaded with `VitsModel` and `AutoTokenizer`.

## Files

- `model.safetensors` - fine-tuned VITS model weights.
- `config.json`, `vocab.json`, tokenizer files - model and tokenizer configuration.
- `examples/inference.py` - minimal local inference script.
- `eval/benchmark/` - generated benchmark samples, review sheet, manifest, and timing summary.
- `training/` - training configuration and local wrapper used for this experiment.

Raw training audio is not included in this release directory.

## Usage

```bash
pip install -r requirements.txt
python examples/inference.py --text "សួស្តីអ្នកទាំងអស់គ្នា" --output khmer_tts.wav
```

Or load the model directly:

```python
import torch
from scipy.io.wavfile import write
from transformers import AutoTokenizer, VitsModel

repo_id = "khmerttsopensource/khmer-tts"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = VitsModel.from_pretrained(repo_id)

text = "សួស្តីអ្នកទាំងអស់គ្នា"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    waveform = model(**inputs).waveform.squeeze().cpu().numpy()

write("khmer_tts.wav", rate=model.config.sampling_rate, data=waveform)
```

## Evaluation

The included benchmark generated 50 samples.

| Metric | Value |
| --- | ---: |
| Success count | 50 |
| Failure count | 0 |
| Failure rate | 0.0 |
| Mean generation time | 0.434978 seconds |
| Mean audio duration | 3.27936 seconds |
| Mean RTF | 0.136449 |
| Min RTF | 0.026531 |
| Max RTF | 0.289309 |

See `eval/benchmark/review_sheet.csv` for manual review fields and `eval/benchmark/generated/` for generated WAV samples.

## Training Summary

- Base model: `facebook/mms-tts-khm`
- Epochs: `2`
- Batch size: `2`
- Sample rate: `16000`
- Training seed: `987`

## Limitations

This is an experimental single-speaker Khmer TTS model. Review pronunciation, naturalness, and text fidelity before production use. The benchmark samples are generated examples, not a full safety or quality evaluation.

## License

This release uses `cc-by-nc-4.0`, matching the non-commercial license of the base MMS Khmer TTS model. Confirm that any downstream use complies with the base model license and the rights for the fine-tuning data.

## Citation

If you use this model, cite the MMS work:

```bibtex
@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap, Vineel and Tjandra, Andros and Shi, Bowen and Tomasello, Paden and Babu, Arun and Kundu, Sayani and Elkahky, Ali and Ni, Zhaoheng and Vyas, Apoorv and Fazel-Zarandi, Maryam and Adi, Yossi and Zhang, Xiaohui and Hsu, Wei-Ning and Conneau, Alexis and Auli, Michael},
  journal={arXiv},
  year={2023}
}
```