File size: 4,208 Bytes
2d7bca1 6f84009 2d7bca1 6f84009 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
license: mit
datasets:
- doof-ferb/infore1_25hours
---
<div align="center">
<div> </div>
<img src="logo.png" width="300"/> <br>
<a href="https://trendshift.io/repositories/8133" target="_blank"><img src="https://trendshift.io/api/badge/repositories/8133" alt="myshell-ai%2FMeloTTS | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</div>
## Introduction
MeloTTS Vietnamese is a version of MeloTTS optimized for the Vietnamese language. This version inherits the high-quality characteristics of the original model but has been specially adjusted to work well with the Vietnamese language.
## Technical Features
- Uses [underthesea](https://github.com/undertheseanlp/underthesea) for Vietnamese text segmentation
- Integrates [PhoBert](https://github.com/VinAIResearch/PhoBERT) (vinai/phobert-base-v2) to extract Vietnamese language features
- Fully supports Vietnamese language characteristics:
- 45 symbols (phonemes)
- 8 tones (7 tonal marks and 1 unmarked tone)
- All defined in `melo/text/symbols.py`
- Text-to-phoneme conversion source:
- Based on [Text2PhonemeSequence](https://github.com/thelinhbkhn2014/Text2PhonemeSequence) library
- An improved version with higher performance has been developed at [Text2PhonemeFast](https://github.com/manhcuong02/Text2PhonemeFast)
## Fine-tuning from Base Model
This model was fine-tuned from the base MeloTTS model by:
- Replacing phonemes not found in English and Vietnamese with Vietnamese phonemes
- Specifically replacing Korean phonemes with corresponding Vietnamese phonemes
- Adjusting parameters to match Vietnamese phonetic characteristics
## Training Data
- The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
- Note on data quality: This dataset has several limitations including poor voice quality, lack of punctuation, and inaccurate phonetic transcriptions. However, when trained on internal data, the results were much better.
## Downloading the Model
The pre-trained model can be downloaded from Hugging Face:
- [MeloTTS Vietnamese on Hugging Face](https://huggingface.co/nmcuong/MeloTTS_Vietnamese)
## Usage Guide
### Data Preparation
The data preparation process is detailed in `docs/training.md`. Basically, you need:
- Audio files (recommended to use 44100Hz format)
- Metadata file with the format:
```
path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001>
path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>
```
### Data Preprocessing
To process data, use the command:
```bash
python melo/preprocess_text.py --metadata /path/to/text_training.list --config_path /path/to/config.json --device cuda:0 --val-per-spk 10 --max-val-total 500
```
or use the script `melo/preprocess_text.sh` with appropriate parameters.
### Using the Model
Refer to the notebook `test_infer.ipynb` to learn how to use the model:
```python
# colab_infer.py
from melo.api import TTS
# Speed is adjustable
speed = 1.0
# CPU is sufficient for real-time inference.
# You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
device = "cuda:0" # Will automatically use GPU if available
# English
model = TTS(
language="VI",
device=device,
config_path="/path/to/config.json",
ckpt_path="/path/to/G_model.pth",
)
speaker_ids = model.hps.data.spk2id
# Convert text to speech
text = "Nhập văn bản tại đây"
speaker_ids = model.hps.data.spk2id
output_path = "output.wav"
model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=1.0, quiet=True)
```
## Audio Examples
Listen to sample outputs from the model:
### Sample Audio
<audio controls>
<source src="samples/sample.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
## License
This project follows the MIT License, like the original MeloTTS project, allowing use for both commercial and non-commercial purposes.
## Acknowledgements
This implementation is based on [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), [VITS2](https://github.com/daniilrobnikov/vits2) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their awesome work. |