| --- |
| license: cc |
| datasets: |
| - speechcolab/gigaspeech |
| language: |
| - th |
| base_model: |
| - SWivid/F5-TTS |
| pipeline_tag: text-to-speech |
| tags: |
| - flow-matching |
| - f5-tts |
| - thai |
| - finetuning |
| --- |
| <p align="center"> |
| <img src="assets/ThonburianTTSLogo.png" width="400"/><br> |
| <img src="assets/looloo-logo.png" width="150" /> |
| </p> |
|
|
|
|
| [🔊 Model Checkpoints](https://huggingface.co/biodatlab/ThonburianTTS) | [🤗 Gradio Demo](https://github.com/biodatlab/thonburian-tts/blob/main/gradio_app.py) | [📄 ThonburianTTS Paper](https://ieeexplore.ieee.org/document/11320472) | [Colab Notebook](https://colab.research.google.com/drive/1vIwNMjsyILluNT0l7I8KduS7S2Bhj9ra?usp=sharing) | [GitHub](https://github.com/biodatlab/thonburian-tts) |
|
|
| ## **Thonburian TTS** |
|
|
| **Thonburian TTS** is a **Thai Text-to-Speech (TTS)** engine built on top of the [F5-TTS](https://github.com/SWivid/F5-TTS). |
| It generates **natural and expressive Thai speech** by leveraging **Flow-Matching diffusion techniques** and can **mimic reference voices** from short audio samples. The system supports: |
|
|
| - **Thai language generation** (`language="th"`) |
| - **Reference-based voice cloning** using short audio clips |
| - High-quality synthesis with controllable speed and silence trimming |
|
|
|
|
| ## **Model Checkpoints** |
|
|
| | Model Component | Description | URL | |
| | ---------------------- | ---------------------------------- | ---------------------------------------------------------------------------- | |
| | **F5-TTS Thai** | Flow Matching-based Thai TTS models | [Link](https://huggingface.co/biodatlab/ThonburianTTS/tree/main/megaF5) | |
| | **F5-TTS IPA** | Flow Matching-based Thai-IPA TTS models | [Link](https://huggingface.co/biodatlab/ThonburianTTS/tree/main/megaIPA) | |
|
|
|
|
| ## **Quick Usage** |
|
|
| ### **Installation** |
|
|
| Install dependencies: |
|
|
| ```bash |
| pip install torch cached-path librosa transformers f5-tts |
| sudo apt install ffmpeg |
| ``` |
|
|
| ### **Clone GitHub** |
|
|
| ``` |
| git clone https://github.com/biodatlab/thonburian-tts.git |
| cd thonburian-tts |
| ``` |
|
|
| #### **Loading Thai Script based Models** |
| ```py |
| from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig |
| import torch |
| |
| # Configure F5-TTS model |
| model_config = ModelConfig( |
| language="th", |
| model_type="F5", |
| checkpoint="hf://biodatlab/ThonburianTTS/megaF5/mega_f5_last.safetensors", |
| vocab_file="hf://biodatlab/ThonburianTTS/megaF5/mega_vocab.txt", |
| vocoder="vocos", |
| device="cuda" if torch.cuda.is_available() else "cpu" |
| ) |
| |
| # Basic audio settings |
| audio_config = AudioConfig( |
| silence_threshold=-45, |
| cfg_strength=2.5, |
| speed=1.0 |
| ) |
| |
| pipeline = FlowTTSPipeline(model_config, audio_config) |
| ``` |
|
|
|
|
| #### **Loading IPA based Models** |
| ```py |
| from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig |
| import torch |
| |
| # Configure F5-TTS model |
| model_config = ModelConfig( |
| model_type="F5", |
| checkpoint="hf://biodatlab/ThonburianTTS/megaIPA/model_last_prune.safetensors", |
| vocab_file="hf://biodatlab/ThonburianTTS/megaIPA/mega_vocab_ipa.txt", |
| vocoder="vocos", |
| device="cuda" if torch.cuda.is_available() else "cpu" |
| ) |
| |
| # Basic audio settings |
| audio_config = AudioConfig( |
| silence_threshold=-45, |
| cfg_strength=2.5, |
| speed=1.0 |
| ) |
| |
| pipeline = FlowTTSPipeline(model_config, audio_config) |
| ``` |
|
|
| ## **Example Outputs** |
|
|
| <table> |
| <tr> |
| <td align="center"> |
| <a href="https://youtu.be/rvmNgh0-jws"> |
| <img src="https://img.youtube.com/vi/rvmNgh0-jws/0.jpg" width="320"><br> |
| 🎵 Sample 1 – Single-speaker Thai Normal Text |
| </a> |
| </td> |
| <td align="center"> |
| <a href="https://youtu.be/jVz3EpRTn1U"> |
| <img src="https://img.youtube.com/vi/jVz3EpRTn1U/0.jpg" width="320"><br> |
| 🎵 Sample 2 – Single-Speaker Thai Code-mixed Text |
| </a> |
| </td> |
| <td align="center"> |
| <a href="https://youtu.be/sbaOdMhz3Z4"> |
| <img src="https://img.youtube.com/vi/sbaOdMhz3Z4/0.jpg" width="320"><br> |
| 🎵 Sample 3 – Multi-Speaker Conversational Speech |
| </a> |
| </td> |
| </tr> |
| </table> |
| |
| --- |
|
|
| ## **Developers** |
|
|
| - [Looloo Technology](https://loolootech.com/) |
| - [Biomedical and Data Lab, Mahidol University](https://biodatlab.github.io/) |
|
|
| <p align="center"> |
| <img width="150px" src="assets/looloo-logo.png" /> |
| </p> |
|
|
|
|
| ## **Citation** |
|
|
| If you use **ThonburianTTS** in your research, please cite: |
|
|
| ``` |
| @INPROCEEDINGS{11320472, |
| author={Aung, Thura and Sriwirote, Panyut and Thavornmongkol, Thanachot and Pipatsrisawat, Knot and Achakulvisut, Titipat and Aung, Zaw Htet}, |
| booktitle={2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)}, |
| title={ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech}, |
| year={2025}, |
| volume={}, |
| number={}, |
| pages={1-6}, |
| keywords={Adaptation models;Codes;Accuracy;Error analysis;Phonetics;Robustness;Natural language processing;Text to speech;Noise measurement;Research and development;Thai text-to-speech;Flow matching;F5-TTS}, |
| doi={10.1109/iSAI-NLP66160.2025.11320472}} |
| ``` |
|
|
| ``` |
| Thura Aung, Panyut Sriwirote, Thanachot Thavornmongkol, Knot Pipatsrisawat, Titipat Achakulvisut, Zaw Htet Aung, "ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech", 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Phuket, Thailand, 2025, pp. 1-6, doi: 10.1109/iSAI-NLP66160.2025.11320472. |
| ``` |
|
|
| ## **License** |
|
|
| The **models** are released under the [Creative Commons Attribution Non-Commercial ShareAlike 4.0 License (CC BY-NC-SA 4.0)](LICENSE-CC-BY-NC-SA). |
|
|
| ## Acknowledgement |
| We would like to acknowledge NSTDA Supercomputer Center (ThaiSC) project \#pv824003 for providing computing resources for this work. |
|
|