Duplicate from biodatlab/ThonburianTTS

164e5d5 29 days ago

5.85 kB

	---
	license: cc
	datasets:
	- speechcolab/gigaspeech
	language:
	- th
	base_model:
	- SWivid/F5-TTS
	pipeline_tag: text-to-speech
	tags:
	- flow-matching
	- f5-tts
	- thai
	- finetuning
	---
	<p align="center">
	<img src="assets/ThonburianTTSLogo.png" width="400"/><br>
	<img src="assets/looloo-logo.png" width="150" />
	</p>


	[🔊 Model Checkpoints](https://huggingface.co/biodatlab/ThonburianTTS) \| [🤗 Gradio Demo](https://github.com/biodatlab/thonburian-tts/blob/main/gradio_app.py) \| [📄 ThonburianTTS Paper](https://ieeexplore.ieee.org/document/11320472) \| [Colab Notebook](https://colab.research.google.com/drive/1vIwNMjsyILluNT0l7I8KduS7S2Bhj9ra?usp=sharing) \| [GitHub](https://github.com/biodatlab/thonburian-tts)

	## Thonburian TTS

	Thonburian TTS is a Thai Text-to-Speech (TTS) engine built on top of the [F5-TTS](https://github.com/SWivid/F5-TTS).
	It generates natural and expressive Thai speech by leveraging Flow-Matching diffusion techniques and can mimic reference voices from short audio samples. The system supports:

	- Thai language generation (`language="th"`)
	- Reference-based voice cloning using short audio clips
	- High-quality synthesis with controllable speed and silence trimming


	## Model Checkpoints

	\| Model Component \| Description \| URL \|
	\| ---------------------- \| ---------------------------------- \| ---------------------------------------------------------------------------- \|
	\| F5-TTS Thai \| Flow Matching-based Thai TTS models \| [Link](https://huggingface.co/biodatlab/ThonburianTTS/tree/main/megaF5) \|
	\| F5-TTS IPA \| Flow Matching-based Thai-IPA TTS models \| [Link](https://huggingface.co/biodatlab/ThonburianTTS/tree/main/megaIPA) \|


	## Quick Usage

	### Installation

	Install dependencies:

	```bash
	pip install torch cached-path librosa transformers f5-tts
	sudo apt install ffmpeg
	```

	### Clone GitHub

	```
	git clone https://github.com/biodatlab/thonburian-tts.git
	cd thonburian-tts
	```

	#### Loading Thai Script based Models
	```py
	from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
	import torch

	# Configure F5-TTS model
	model_config = ModelConfig(
	language="th",
	model_type="F5",
	checkpoint="hf://biodatlab/ThonburianTTS/megaF5/mega_f5_last.safetensors",
	vocab_file="hf://biodatlab/ThonburianTTS/megaF5/mega_vocab.txt",
	vocoder="vocos",
	device="cuda" if torch.cuda.is_available() else "cpu"
	)

	# Basic audio settings
	audio_config = AudioConfig(
	silence_threshold=-45,
	cfg_strength=2.5,
	speed=1.0
	)

	pipeline = FlowTTSPipeline(model_config, audio_config)
	```


	#### Loading IPA based Models
	```py
	from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
	import torch

	# Configure F5-TTS model
	model_config = ModelConfig(
	model_type="F5",
	checkpoint="hf://biodatlab/ThonburianTTS/megaIPA/model_last_prune.safetensors",
	vocab_file="hf://biodatlab/ThonburianTTS/megaIPA/mega_vocab_ipa.txt",
	vocoder="vocos",
	device="cuda" if torch.cuda.is_available() else "cpu"
	)

	# Basic audio settings
	audio_config = AudioConfig(
	silence_threshold=-45,
	cfg_strength=2.5,
	speed=1.0
	)

	pipeline = FlowTTSPipeline(model_config, audio_config)
	```

	## Example Outputs

	<table>
	<tr>
	<td align="center">
	<a href="https://youtu.be/rvmNgh0-jws">
	<img src="https://img.youtube.com/vi/rvmNgh0-jws/0.jpg" width="320"><br>
	🎵 Sample 1 – Single-speaker Thai Normal Text
	</a>
	</td>
	<td align="center">
	<a href="https://youtu.be/jVz3EpRTn1U">
	<img src="https://img.youtube.com/vi/jVz3EpRTn1U/0.jpg" width="320"><br>
	🎵 Sample 2 – Single-Speaker Thai Code-mixed Text
	</a>
	</td>
	<td align="center">
	<a href="https://youtu.be/sbaOdMhz3Z4">
	<img src="https://img.youtube.com/vi/sbaOdMhz3Z4/0.jpg" width="320"><br>
	🎵 Sample 3 – Multi-Speaker Conversational Speech
	</a>
	</td>
	</tr>
	</table>

	---

	## Developers

	- [Looloo Technology](https://loolootech.com/)
	- [Biomedical and Data Lab, Mahidol University](https://biodatlab.github.io/)

	<p align="center">
	<img width="150px" src="assets/looloo-logo.png" />
	</p>


	## Citation

	If you use ThonburianTTS in your research, please cite:

	```
	@INPROCEEDINGS{11320472,
	author={Aung, Thura and Sriwirote, Panyut and Thavornmongkol, Thanachot and Pipatsrisawat, Knot and Achakulvisut, Titipat and Aung, Zaw Htet},
	booktitle={2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)},
	title={ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech},
	year={2025},
	volume={},
	number={},
	pages={1-6},
	keywords={Adaptation models;Codes;Accuracy;Error analysis;Phonetics;Robustness;Natural language processing;Text to speech;Noise measurement;Research and development;Thai text-to-speech;Flow matching;F5-TTS},
	doi={10.1109/iSAI-NLP66160.2025.11320472}}
	```

	```
	Thura Aung, Panyut Sriwirote, Thanachot Thavornmongkol, Knot Pipatsrisawat, Titipat Achakulvisut, Zaw Htet Aung, "ThonburianTTS: Enhancing Neural Flow Matching Models for Authentic Thai Text-to-Speech", 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Phuket, Thailand, 2025, pp. 1-6, doi: 10.1109/iSAI-NLP66160.2025.11320472.
	```

	## License

	The models are released under the [Creative Commons Attribution Non-Commercial ShareAlike 4.0 License (CC BY-NC-SA 4.0)](LICENSE-CC-BY-NC-SA).

	## Acknowledgement
	We would like to acknowledge NSTDA Supercomputer Center (ThaiSC) project \#pv824003 for providing computing resources for this work.