MeloTTS_Vietnamese / README.md

Update README.md

6f84009 verified 8 months ago

4.21 kB

	---
	license: mit
	datasets:
	- doof-ferb/infore1_25hours
	---
	<div align="center">
	<div> </div>
	<img src="logo.png" width="300"/> <br>
	<a href="https://trendshift.io/repositories/8133" target="_blank"><img src="https://trendshift.io/api/badge/repositories/8133" alt="myshell-ai%2FMeloTTS \| Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
	</div>

	## Introduction
	MeloTTS Vietnamese is a version of MeloTTS optimized for the Vietnamese language. This version inherits the high-quality characteristics of the original model but has been specially adjusted to work well with the Vietnamese language.

	## Technical Features
	- Uses [underthesea](https://github.com/undertheseanlp/underthesea) for Vietnamese text segmentation
	- Integrates [PhoBert](https://github.com/VinAIResearch/PhoBERT) (vinai/phobert-base-v2) to extract Vietnamese language features
	- Fully supports Vietnamese language characteristics:
	- 45 symbols (phonemes)
	- 8 tones (7 tonal marks and 1 unmarked tone)
	- All defined in `melo/text/symbols.py`
	- Text-to-phoneme conversion source:
	- Based on [Text2PhonemeSequence](https://github.com/thelinhbkhn2014/Text2PhonemeSequence) library
	- An improved version with higher performance has been developed at [Text2PhonemeFast](https://github.com/manhcuong02/Text2PhonemeFast)

	## Fine-tuning from Base Model
	This model was fine-tuned from the base MeloTTS model by:
	- Replacing phonemes not found in English and Vietnamese with Vietnamese phonemes
	- Specifically replacing Korean phonemes with corresponding Vietnamese phonemes
	- Adjusting parameters to match Vietnamese phonetic characteristics

	## Training Data
	- The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
	- Note on data quality: This dataset has several limitations including poor voice quality, lack of punctuation, and inaccurate phonetic transcriptions. However, when trained on internal data, the results were much better.

	## Downloading the Model
	The pre-trained model can be downloaded from Hugging Face:
	- [MeloTTS Vietnamese on Hugging Face](https://huggingface.co/nmcuong/MeloTTS_Vietnamese)

	## Usage Guide

	### Data Preparation
	The data preparation process is detailed in `docs/training.md`. Basically, you need:
	- Audio files (recommended to use 44100Hz format)
	- Metadata file with the format:
	```
	path/to/audio_001.wav \|<speaker_name>\|<language_code>\|<text_001>
	path/to/audio_002.wav \|<speaker_name>\|<language_code>\|<text_002>
	```

	### Data Preprocessing
	To process data, use the command:
	```bash
	python melo/preprocess_text.py --metadata /path/to/text_training.list --config_path /path/to/config.json --device cuda:0 --val-per-spk 10 --max-val-total 500
	```
	or use the script `melo/preprocess_text.sh` with appropriate parameters.

	### Using the Model
	Refer to the notebook `test_infer.ipynb` to learn how to use the model:
	```python
	# colab_infer.py
	from melo.api import TTS

	# Speed is adjustable
	speed = 1.0

	# CPU is sufficient for real-time inference.
	# You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
	device = "cuda:0" # Will automatically use GPU if available

	# English
	model = TTS(
	language="VI",
	device=device,
	config_path="/path/to/config.json",
	ckpt_path="/path/to/G_model.pth",
	)
	speaker_ids = model.hps.data.spk2id

	# Convert text to speech
	text = "Nhập văn bản tại đây"
	speaker_ids = model.hps.data.spk2id
	output_path = "output.wav"
	model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=1.0, quiet=True)
	```

	## Audio Examples
	Listen to sample outputs from the model:

	### Sample Audio
	<audio controls>
	<source src="samples/sample.wav" type="audio/wav">
	Your browser does not support the audio element.
	</audio>

	## License
	This project follows the MIT License, like the original MeloTTS project, allowing use for both commercial and non-commercial purposes.

	## Acknowledgements

	This implementation is based on [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), [VITS2](https://github.com/daniilrobnikov/vits2) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their awesome work.