nmcuong commited on
Commit
ac7e949
·
verified ·
1 Parent(s): d86c8ba

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +123 -44
README.md CHANGED
@@ -1,13 +1,13 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - doof-ferb/infore1_25hours
5
- language:
6
- - vi
7
- base_model:
8
- - myshell-ai/MeloTTS-English
9
- pipeline_tag: text-to-speech
10
- ---
11
  <div align="center">
12
  <div>&nbsp;</div>
13
  <img src="logo.png" width="300"/> <br>
@@ -15,66 +15,87 @@ pipeline_tag: text-to-speech
15
  </div>
16
 
17
  ## Introduction
18
- MeloTTS Vietnamese is a version of MeloTTS optimized for the Vietnamese language. This version inherits the high-quality characteristics of the original model but has been specially adjusted to work well with the Vietnamese language.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Technical Features
21
  - Uses [underthesea](https://github.com/undertheseanlp/underthesea) for Vietnamese text segmentation
22
- - Integrates [PhoBert](https://github.com/VinAIResearch/PhoBERT) (vinai/phobert-base-v2) to extract Vietnamese language features
23
- - Fully supports Vietnamese language characteristics:
24
  - 45 symbols (phonemes)
25
  - 8 tones (7 tonal marks and 1 unmarked tone)
26
  - All defined in `melo/text/symbols.py`
27
- - Text-to-phoneme conversion source:
28
- - Based on [Text2PhonemeSequence](https://github.com/thelinhbkhn2014/Text2PhonemeSequence) library
29
- - An improved version with higher performance has been developed at [Text2PhonemeFast](https://github.com/manhcuong02/Text2PhonemeFast)
30
 
31
  ## Fine-tuning from Base Model
32
  This model was fine-tuned from the base [MeloTTS](https://github.com/myshell-ai/MeloTTS) model by:
33
- - Replacing phonemes not found in English and Vietnamese with Vietnamese phonemes
34
- - Specifically replacing Korean phonemes with corresponding Vietnamese phonemes
35
- - Adjusting parameters to match Vietnamese phonetic characteristics
36
- - Github: [MeloTTS Vietnamese](https://github.com/manhcuong02/MeloTTS_Vietnamese)
37
 
38
  ## Training Data
39
  - The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
40
- - Note on data quality: This dataset has several limitations including poor voice quality, lack of punctuation, and inaccurate phonetic transcriptions. However, when trained on internal data, the results were much better.
41
 
42
  ## Downloading the Model
43
  The pre-trained model can be downloaded from Hugging Face:
44
  - [MeloTTS Vietnamese on Hugging Face](https://huggingface.co/nmcuong/MeloTTS_Vietnamese)
45
 
 
 
46
  ## Usage Guide
47
 
48
- ### Data Preparation
49
- The data preparation process is detailed in `docs/training.md`. Basically, you need:
50
- - Audio files (recommended to use 44100Hz format)
51
- - Metadata file with the format:
52
- ```
53
- path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001>
54
- path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>
55
- ```
56
 
57
- ### Data Preprocessing
58
- To process data, use the command:
59
  ```bash
60
- python melo/preprocess_text.py --metadata /path/to/text_training.list --config_path /path/to/config.json --device cuda:0 --val-per-spk 10 --max-val-total 500
 
 
61
  ```
62
- or use the script `melo/preprocess_text.sh` with appropriate parameters.
63
 
64
- ### Using the Model
65
- Refer to the notebook `test_infer.ipynb` to learn how to use the model:
 
 
 
 
 
 
66
  ```python
67
- # colab_infer.py
68
  from melo.api import TTS
69
 
70
  # Speed is adjustable
71
  speed = 1.0
72
 
73
- # CPU is sufficient for real-time inference.
74
- # You can set it manually to 'cpu' or 'cuda' or 'cuda:0' or 'mps'
75
  device = "cuda:0" # Will automatically use GPU if available
76
 
77
- # English
78
  model = TTS(
79
  language="VI",
80
  device=device,
@@ -85,20 +106,78 @@ speaker_ids = model.hps.data.spk2id
85
 
86
  # Convert text to speech
87
  text = "Nhập văn bản tại đây"
88
- speaker_ids = model.hps.data.spk2id
89
  output_path = "output.wav"
90
- model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=1.0, quiet=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## Audio Examples
 
94
  Listen to sample outputs from the model:
95
 
96
- ### Sample Audio
 
 
97
  <audio controls src="https://huggingface.co/nmcuong/MeloTTS_Vietnamese/resolve/main/samples/sample.wav"></audio>
98
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ## License
100
- This project follows the MIT License, like the original MeloTTS project, allowing use for both commercial and non-commercial purposes.
101
 
102
  ## Acknowledgements
103
 
104
- This implementation is based on [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), [VITS2](https://github.com/daniilrobnikov/vits2) and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their awesome work.
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - doof-ferb/infore1_25hours
5
+ language:
6
+ - vi
7
+ base_model:
8
+ - myshell-ai/MeloTTS-English
9
+ pipeline_tag: text-to-speech
10
+ ---
11
  <div align="center">
12
  <div>&nbsp;</div>
13
  <img src="logo.png" width="300"/> <br>
 
15
  </div>
16
 
17
  ## Introduction
18
+
19
+ ### About MeloTTS
20
+
21
+ [MeloTTS](https://github.com/myshell-ai/MeloTTS) is a high-quality, open-source text-to-speech system developed by MyShell AI. It is built on top of the VITS/VITS2 architecture and uses BERT-based linguistic features to produce natural-sounding speech. MeloTTS supports multiple languages and is designed to be fast enough for real-time CPU inference.
22
+
23
+ **Strengths of the original MeloTTS:**
24
+ - High naturalness and expressiveness in synthesized speech
25
+ - Fast inference — runs in real-time even on CPU
26
+ - Lightweight and easy to deploy
27
+ - Supports multiple languages (English, Chinese, Japanese, Korean, Spanish, French)
28
+ - Permissive MIT license, suitable for both commercial and non-commercial use
29
+
30
+ **Limitations of the original MeloTTS:**
31
+ - Not natively optimized for Vietnamese phonology (tones, phonemes)
32
+ - The default English/multilingual phonemizer does not handle Vietnamese tones and diacritics correctly
33
+ - No built-in support for Vietnamese-specific linguistic preprocessing
34
+
35
+ ### MeloTTS Vietnamese
36
+
37
+ **MeloTTS Vietnamese** is a version of MeloTTS specifically optimized for the Vietnamese language. It inherits the high-quality and fast-inference characteristics of the original model while introducing targeted improvements to handle the unique phonological properties of Vietnamese — including its 6 tones, complex vowel system, and syllable structure.
38
+
39
+ This model is designed to produce natural, accurate Vietnamese speech and can be easily fine-tuned on custom Vietnamese datasets.
40
 
41
  ## Technical Features
42
  - Uses [underthesea](https://github.com/undertheseanlp/underthesea) for Vietnamese text segmentation
43
+ - Integrates [PhoBERT](https://github.com/VinAIResearch/PhoBERT) (vinai/phobert-base-v2) to extract Vietnamese linguistic features
44
+ - Full support for Vietnamese language characteristics:
45
  - 45 symbols (phonemes)
46
  - 8 tones (7 tonal marks and 1 unmarked tone)
47
  - All defined in `melo/text/symbols.py`
48
+ - Text-to-phoneme conversion:
49
+ - Based on the [Text2PhonemeSequence](https://github.com/thelinhbkhn2014/Text2PhonemeSequence) library
50
+ - An improved higher-performance version is available at [Text2PhonemeFast](https://github.com/manhcuong02/Text2PhonemeFast)
51
 
52
  ## Fine-tuning from Base Model
53
  This model was fine-tuned from the base [MeloTTS](https://github.com/myshell-ai/MeloTTS) model by:
54
+ - Replacing phonemes not found in English/Vietnamese with Vietnamese-specific phonemes
55
+ - Specifically replacing Korean phonemes with their corresponding Vietnamese equivalents
56
+ - Adjusting model parameters to match Vietnamese phonetic characteristics
57
+ - GitHub: [MeloTTS Vietnamese](https://github.com/manhcuong02/MeloTTS_Vietnamese)
58
 
59
  ## Training Data
60
  - The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
61
+ - **Note on data quality:** This dataset has several limitations including suboptimal voice quality, missing punctuation, and imprecise phonetic transcriptions. However, when trained on internal/private high-quality data, results are significantly better.
62
 
63
  ## Downloading the Model
64
  The pre-trained model can be downloaded from Hugging Face:
65
  - [MeloTTS Vietnamese on Hugging Face](https://huggingface.co/nmcuong/MeloTTS_Vietnamese)
66
 
67
+ ---
68
+
69
  ## Usage Guide
70
 
71
+ ### Part 1: Inference
72
+
73
+ #### 1. Clone the Repository and Install Dependencies
 
 
 
 
 
74
 
 
 
75
  ```bash
76
+ git clone https://github.com/manhcuong02/MeloTTS_Vietnamese.git
77
+ cd MeloTTS_Vietnamese
78
+ pip install -r requirements.txt
79
  ```
 
80
 
81
+ #### 2. Download the Pre-trained Model
82
+
83
+ Download the model checkpoint and config from [Hugging Face](https://huggingface.co/nmcuong/MeloTTS_Vietnamese) and place them in your desired directory.
84
+
85
+ #### 3. Run Inference
86
+
87
+ Refer to the notebook `test_infer.ipynb` for a full example. Basic usage:
88
+
89
  ```python
 
90
  from melo.api import TTS
91
 
92
  # Speed is adjustable
93
  speed = 1.0
94
 
95
+ # You can set device to 'cpu', 'cuda', 'cuda:0', or 'mps'
 
96
  device = "cuda:0" # Will automatically use GPU if available
97
 
98
+ # Load the Vietnamese TTS model
99
  model = TTS(
100
  language="VI",
101
  device=device,
 
106
 
107
  # Convert text to speech
108
  text = "Nhập văn bản tại đây"
 
109
  output_path = "output.wav"
110
+ model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=speed, quiet=True)
111
+ ```
112
+
113
+ ---
114
+
115
+ ### Part 2: Training & Fine-tuning
116
+
117
+ #### 1. Data Preparation
118
+
119
+ The full data preparation process is detailed in `docs/training.md`. At minimum, you need:
120
+ - Audio files (recommended sample rate: 44100 Hz)
121
+ - A metadata file in the following format:
122
+ ```
123
+ path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001>
124
+ path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>
125
+ ```
126
+
127
+ #### 2. Data Preprocessing
128
+
129
+ Run the preprocessing script to prepare training data:
130
+
131
+ ```bash
132
+ python melo/preprocess_text.py \
133
+ --metadata /path/to/text_training.list \
134
+ --config_path /path/to/config.json \
135
+ --device cuda:0 \
136
+ --val-per-spk 10 \
137
+ --max-val-total 500
138
  ```
139
 
140
+ Alternatively, use the shell script `melo/preprocess_text.sh` with appropriate parameters.
141
+
142
+ #### 3. Start Training
143
+
144
+ Follow the training instructions in `docs/training.md`.
145
+
146
+ ---
147
+
148
+ ## Code & Fine-tuning
149
+
150
+ The Vietnamese adaptation, code implementation, and fine-tuning of this model were developed by **Nguyễn Mạnh Cường**.
151
+
152
+ - GitHub: [manhcuong02](https://github.com/manhcuong02)
153
+ - Repository: [MeloTTS Vietnamese](https://github.com/manhcuong02/MeloTTS_Vietnamese)
154
+
155
+ ---
156
+
157
  ## Audio Examples
158
+
159
  Listen to sample outputs from the model:
160
 
161
+ ### Sample 1
162
+ > *"Buổi sáng ở thành phố bắt đầu bằng tiếng xe cộ nhộn nhịp và ánh nắng nhẹ xuyên qua những tòa nhà cao tầng."*
163
+
164
  <audio controls src="https://huggingface.co/nmcuong/MeloTTS_Vietnamese/resolve/main/samples/sample.wav"></audio>
165
 
166
+ ### Sample 2
167
+ > *"Người đi làm vội vã, học sinh ríu rít trò chuyện, còn quán cà phê góc phố thì thoang thoảng mùi thơm dễ chịu."*
168
+
169
+ <audio controls src="https://huggingface.co/nmcuong/MeloTTS_Vietnamese/resolve/main/samples/sample-2.wav"></audio>
170
+
171
+ ### Sample 3
172
+ > *"Cuối cùng, hãy thử thì thầm một câu thật nhẹ nhàng, rồi bất ngờ chuyển sang giọng nói to, rõ và đầy năng lượng."*
173
+
174
+ <audio controls src="https://huggingface.co/nmcuong/MeloTTS_Vietnamese/resolve/main/samples/sample-3.wav"></audio>
175
+
176
+ ---
177
+
178
  ## License
179
+ This project is licensed under the [MIT License](LICENSE), consistent with the original MeloTTS project. It may be used for both commercial and non-commercial purposes.
180
 
181
  ## Acknowledgements
182
 
183
+ This implementation is based on [TTS](https://github.com/coqui-ai/TTS), [VITS](https://github.com/jaywalnut310/vits), [VITS2](https://github.com/daniilrobnikov/vits2), and [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2). We appreciate their outstanding work.