|
|
--- |
|
|
license: cc |
|
|
datasets: |
|
|
- speechcolab/gigaspeech |
|
|
language: |
|
|
- th |
|
|
base_model: |
|
|
- SWivid/F5-TTS |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- flow-matching |
|
|
- f5-tts |
|
|
- thai |
|
|
- finetuning |
|
|
--- |
|
|
<p align="center"> |
|
|
<img src="assets/ThonburianTTSLogo.png" width="400"/><br> |
|
|
<img src="assets/looloo-logo.png" width="150" /> |
|
|
</p> |
|
|
|
|
|
|
|
|
[π Model Checkpoints](https://huggingface.co/biodatlab/ThonburianTTS) | [π€ Gradio Demo](https://github.com/biodatlab/thonburian-tts/blob/main/gradio_app.py) | [π ThonburianTTS Paper](https://ieeexplore.ieee.org/document/11320472) | [Colab Notebook](https://colab.research.google.com/drive/1vIwNMjsyILluNT0l7I8KduS7S2Bhj9ra?usp=sharing) | [GitHub](https://github.com/biodatlab/thonburian-tts) |
|
|
|
|
|
## **Thonburian TTS** |
|
|
|
|
|
**Thonburian TTS** is a **Thai Text-to-Speech (TTS)** engine built on top of the [F5-TTS](https://github.com/SWivid/F5-TTS). |
|
|
It generates **natural and expressive Thai speech** by leveraging **Flow-Matching diffusion techniques** and can **mimic reference voices** from short audio samples. The system supports: |
|
|
|
|
|
- **Thai language generation** (`language="th"`) |
|
|
- **Reference-based voice cloning** using short audio clips |
|
|
- High-quality synthesis with controllable speed and silence trimming |
|
|
|
|
|
|
|
|
## **Model Checkpoints** |
|
|
|
|
|
| Model Component | Description | URL | |
|
|
| ---------------------- | ---------------------------------- | ---------------------------------------------------------------------------- | |
|
|
| **F5-TTS Thai** | Flow Matching-based Thai TTS models | [Link](https://huggingface.co/biodatlab/ThonburianTTS/tree/main/megaF5) | |
|
|
| **F5-TTS IPA** | Flow Matching-based Thai-IPA TTS models | [Link](https://huggingface.co/biodatlab/ThonburianTTS/tree/main/megaIPA) | |
|
|
|
|
|
|
|
|
## **Quick Usage** |
|
|
|
|
|
### **Installation** |
|
|
|
|
|
Install dependencies: |
|
|
|
|
|
```bash |
|
|
pip install torch cached-path librosa transformers f5-tts |
|
|
sudo apt install ffmpeg |
|
|
``` |
|
|
|
|
|
### **Clone GitHub** |
|
|
|
|
|
``` |
|
|
git clone https://github.com/biodatlab/thonburian-tts.git |
|
|
cd thonburian-tts |
|
|
``` |
|
|
|
|
|
#### **Loading Thai Script based Models** |
|
|
```py |
|
|
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig |
|
|
import torch |
|
|
|
|
|
# Configure F5-TTS model |
|
|
model_config = ModelConfig( |
|
|
language="th", |
|
|
model_type="F5", |
|
|
checkpoint="hf://biodatlab/ThonburianTTS/megaF5/mega_f5_last.safetensors", |
|
|
vocab_file="hf://biodatlab/ThonburianTTS/megaF5/mega_vocab.txt", |
|
|
vocoder="vocos", |
|
|
device="cuda" if torch.cuda.is_available() else "cpu" |
|
|
) |
|
|
|
|
|
# Basic audio settings |
|
|
audio_config = AudioConfig( |
|
|
silence_threshold=-45, |
|
|
cfg_strength=2.5, |
|
|
speed=1.0 |
|
|
) |
|
|
|
|
|
pipeline = FlowTTSPipeline(model_config, audio_config) |
|
|
``` |
|
|
|
|
|
|
|
|
#### **Loading IPA based Models** |
|
|
```py |
|
|
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig |
|
|
import torch |
|
|
|
|
|
# Configure F5-TTS model |
|
|
model_config = ModelConfig( |
|
|
model_type="F5", |
|
|
checkpoint="hf://biodatlab/ThonburianTTS/megaIPA/model_last_prune.safetensors", |
|
|
vocab_file="hf://biodatlab/ThonburianTTS/megaIPA/mega_vocab_ipa.txt", |
|
|
vocoder="vocos", |
|
|
device="cuda" if torch.cuda.is_available() else "cpu" |
|
|
) |
|
|
|
|
|
# Basic audio settings |
|
|
audio_config = AudioConfig( |
|
|
silence_threshold=-45, |
|
|
cfg_strength=2.5, |
|
|
speed=1.0 |
|
|
) |
|
|
|
|
|
pipeline = FlowTTSPipeline(model_config, audio_config) |
|
|
``` |
|
|
|
|
|
## **Example Outputs** |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td align="center"> |
|
|
<a href="https://youtu.be/rvmNgh0-jws"> |
|
|
<img src="https://img.youtube.com/vi/rvmNgh0-jws/0.jpg" width="320"><br> |
|
|
π΅ Sample 1 β Single-speaker Thai Normal Text |
|
|
</a> |
|
|
</td> |
|
|
<td align="center"> |
|
|
<a href="https://youtu.be/jVz3EpRTn1U"> |
|
|
<img src="https://img.youtube.com/vi/jVz3EpRTn1U/0.jpg" width="320"><br> |
|
|
π΅ Sample 2 β Single-Speaker Thai Code-mixed Text |
|
|
</a> |
|
|
</td> |
|
|
<td align="center"> |
|
|
<a href="https://youtu.be/sbaOdMhz3Z4"> |
|
|
<img src="https://img.youtube.com/vi/sbaOdMhz3Z4/0.jpg" width="320"><br> |
|
|
π΅ Sample 3 β Multi-Speaker Conversational Speech |
|
|
</a> |
|
|
</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
## **Developers** |
|
|
|
|
|
- [Looloo Technology](https://loolootech.com/) |
|
|
- [Biomedical and Data Lab, Mahidol University](https://biodatlab.github.io/) |
|
|
|
|
|
<p align="center"> |
|
|
<img width="150px" src="assets/looloo-logo.png" /> |
|
|
</p> |
|
|
|
|
|
|
|
|
## **Citation** |
|
|
|
|
|
If you use **ThonburianTTS** in your research, please cite: |
|
|
|
|
|
``` |
|
|
@INPROCEEDINGS{11320472, |
|
|
author={Aung, Thura and Sriwirote, Panyut and Thavornmongkol, Thanachot and Pipatsrisawat, Knot and Achakulvisut, Titipat and Aung, Zaw Htet}, |
|
|
booktitle={2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)}, |
|
|
title={ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech}, |
|
|
year={2025}, |
|
|
volume={}, |
|
|
number={}, |
|
|
pages={1-6}, |
|
|
keywords={Adaptation models;Codes;Accuracy;Error analysis;Phonetics;Robustness;Natural language processing;Text to speech;Noise measurement;Research and development;Thai text-to-speech;Flow matching;F5-TTS}, |
|
|
doi={10.1109/iSAI-NLP66160.2025.11320472}} |
|
|
``` |
|
|
|
|
|
``` |
|
|
Thura Aung, Panyut Sriwirote, Thanachot Thavornmongkol, Knot Pipatsrisawat, Titipat Achakulvisut, Zaw Htet Aung, "ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech", 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Phuket, Thailand, 2025, pp. 1-6, doi: 10.1109/iSAI-NLP66160.2025.11320472. |
|
|
``` |
|
|
|
|
|
## **License** |
|
|
|
|
|
The **models** are released under the [Creative Commons Attribution Non-Commercial ShareAlike 4.0 License (CC BY-NC-SA 4.0)](LICENSE-CC-BY-NC-SA). |
|
|
|
|
|
## Acknowledgement |
|
|
We would like to acknowledge NSTDA Supercomputer Center (ThaiSC) project \#pv824003 for providing computing resources for this work. |
|
|
|