Update README.md
Browse files
README.md
CHANGED
|
@@ -1,19 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
# Korean FastSpeech 2 - Pytorch Implementation
|
| 3 |
|
| 4 |
-

|
| 5 |
-
# Introduction
|
| 6 |
|
| 7 |
-
|
| 8 |
-
์ฆ ๊ธฐ์กด์ audio-text๋ง์ผ๋ก ์์ธก์ ํ๋ ๋ชจ๋ธ์์, pitch,energy,duration์ ์ถ๊ฐํ ๋ชจ๋ธ์
๋๋ค.
|
| 9 |
-
Fastspeech2์์ duration์ MFA(Montreal Forced Aligner)๋ฅผ ํตํด ์ถ์ถํฉ๋๋ค. ์ด๋ ๊ฒ ์ถ์ถํ duration์ ๋ฐํ์ผ๋ก phoneme(์์)์ ์์ฑ๊ฐ์ alignment๊ฐ ๋ง๋ค์ด์ง๋๋ค.
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
* This Repository๋ https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
# Install Dependencies
|
| 17 |
python=3.9,
|
| 18 |
[pytorch](https://pytorch.org/)=1.13, [ffmpeg](https://ffmpeg.org/) [g2pk](https://github.com/Kyubyong/g2pK)
|
| 19 |
```
|
|
@@ -23,129 +17,13 @@ pip install g2pk
|
|
| 23 |
pip install -r requirements.txt
|
| 24 |
```
|
| 25 |
|
| 26 |
-
#
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
1. wav-lab pair ์์ฑ
|
| 33 |
-
|
| 34 |
-
wavํ์ผ๊ณผ ๊ทธ wavํ์ผ์ ๋ฐํ๋ฅผ transcriptํ labํ์ผ์ด ํ์ํฉ๋๋ค.
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
ํด๋น ํจ์๋ metadata๋ก ๋ถํฐ wavํ์ผ๊ณผ text๋ฅผ ์ธ์ํ์ฌ, wavํ์ผ๊ณผ ํ์ฅ์๋ง ๋ค๋ฅธ transcriptํ์ผ(.lab) ์ ์์ฑํฉ๋๋ค.
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-

|
| 41 |
-
|
| 42 |
-
์์
์ด ๋๋๋ฉด ์์ ํํ์ ๊ฐ์ด wav-lab pair๊ฐ ๋ง๋ค์ด์ ธ์ผ ํฉ๋๋ค.
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
2. lexicon ํ์ผ ์์ฑ
|
| 46 |
-
|
| 47 |
-
๊ฐ์ง๊ณ ์๋ ๋ฐ์ดํฐ์
๋ด์ ๋ชจ๋ ๋ฐํ์ ๋ํ, phoneme์ ๊ธฐ๋กํ lexicon ํ์ผ์ ์์ฑํฉ๋๋ค.
|
| 48 |
-
|
| 49 |
-
[processing_utils.ipynb](https://github.com/JH-lee95/Fastspeech2-Korean/blob/master/processing_utils.ipynb) ๋
ธํธ๋ถ ๋ด์ make_p_dict ์ make_lexicon ํจ์๋ฅผ ์ฐจ๋ก๋๋ก ์คํํด์ฃผ์ธ์.
|
| 50 |
-
|
| 51 |
-

|
| 52 |
-
|
| 53 |
-
์์
์ด ๋๋๋ฉด ์์ ๊ฐ์ ํํ๋ฅผ ๋๋ p_lexicon.txt ํ์ผ์ด ๋ง๋ค์ด์ง๋๋ค.
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
3. MFA ์ค์น
|
| 57 |
-
|
| 58 |
-
* MFA์ ๋ํ ์์ธํ ์ค์น ๋ฐฉ๋ฒ์ https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html ์ด๊ณณ์ ํ์ธํด์ฃผ์ธ์.
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
4. MFA ์คํ
|
| 62 |
-
|
| 63 |
-
MFA์ ๊ฒฝ์ฐ pre-trained๋ ํ๊ตญ์ด acoustic model๊ณผ g2p ๋ชจ๋ธ์ ์ ๊ณตํ๊ณ ์์ต๋๋ค. ํ์ง๋ง ํด๋น ๋ชจ๋ธ์ english phoneme์ ์์ฑํ๊ธฐ ๋๋ฌธ์ ํ๊ตญ์ด phoneme์ ์์ฑํ๊ธฐ ์ํด์๋ ์ง์ train์ ์์ผ์ฃผ์ด์ผ ํฉ๋๋ค.
|
| 64 |
-
|
| 65 |
-
MFA ์ค์น๊ฐ ์๋ฃ๋์๋ค๋ฉด ์๋์ ๊ฐ์ ์ปค๋ฉ๋๋ฅผ ์คํํด์ฃผ์ธ์.
|
| 66 |
-
|
| 67 |
-
```
|
| 68 |
-
mfa train <๋ฐ์ดํฐ์
์์น> <p_lexicon์ ์์น> <out directory>
|
| 69 |
-
```
|
| 70 |
-
|
| 71 |
-
MFA๊ฐ ์ ์์ ์ผ๋ก ์คํ๋์์ ๊ฒฝ์ฐ ๋ค์๊ณผ ๊ฐ์ ํํ์ TextGrid ํ์ผ์ด ๋ง๋ค์ด์ง๋๋ค.
|
| 72 |
-

|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
**(3) ๋ฐ์ดํฐ์ ์ฒ๋ฆฌ**
|
| 77 |
-
|
| 78 |
-
1.hparms.py
|
| 79 |
-
|
| 80 |
-
- dataset : ๋ฐ์ดํฐ์
ํด๋๋ช
|
| 81 |
-
- data_path : dataset์ ์์ ํด๋
|
| 82 |
-
- meta_name : metadata์ ํ์ผ๋ช
ex)transcript.v.1.4.txt
|
| 83 |
-
- textgrid_path : textgrid ์์ถ ํ์ผ์ ์์น (textgrid ํ์ผ๋ค์ ๋ฏธ๋ฆฌ ์์ถํด์ฃผ์ธ์)
|
| 84 |
-
- tetxgrid_name : textgird ์ํน ํ์ผ์ ํ์ผ๋ช
|
| 85 |
-
|
| 86 |
-
2. preprocess.py
|
| 87 |
-
|
| 88 |
-

|
| 89 |
-
|
| 90 |
-
ํด๋น ๋ถ๋ถ์ ๋ณธ์ธ์ ๋ฐ์ดํฐ์
์ด๋ฆ์ ๋ง๊ฒ ๋ณ๊ฒฝํด์ฃผ์ธ์
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
3. data/kss.py
|
| 94 |
-
|
| 95 |
-
- line 19 : basename,text = parts[?],parts[?] #๊ฐ๊ฐ ํ
์คํธ์ ์์น ("|")๋ก splitํ์๋, metadata์ ๊ธฐ๋ก๋ wav์ text์ ์์น
|
| 96 |
-
- line 37 : basename,text = parts[?],parts[?]
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
์์ ๋ณ๊ฒฝ ์์
์ด ๋ชจ๋ ์๋ฃ๋๋ฉด ์๋์ ์ปค๋ฉ๋๋ฅผ ์คํํด์ฃผ์ธ์.
|
| 100 |
-
|
| 101 |
-
```
|
| 102 |
-
python preprocess.py
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
# Train
|
| 106 |
-
๋ชจ๋ธ ํ์ต ์ ์, kss dataset์ ๋ํด ์ฌ์ ํ์ต๋ VocGAN(neural vocoder)์ [๋ค์ด๋ก๋](https://drive.google.com/file/d/1GxaLlTrEhq0aXFvd_X1f4b-ev7-FH8RB/view?usp=sharing) ํ์ฌ ``vocoder/pretrained_models/`` ๊ฒฝ๋ก์ ์์น์ํต๋๋ค.
|
| 107 |
-
|
| 108 |
-
๋ค์์ผ๋ก, ์๋์ ์ปค๋งจ๋๋ฅผ ์
๋ ฅํ์ฌ ๋ชจ๋ธ ํ์ต์ ์ํํฉ๋๋ค.
|
| 109 |
-
```
|
| 110 |
-
python train.py
|
| 111 |
-
```
|
| 112 |
-
ํ์ต๋ ๋ชจ๋ธ์ ``ckpt/``์ ์ ์ฅ๋๊ณ tensorboard log๋ ``log/``์ ์ ์ฅ๋ฉ๋๋ค. ํ์ต์ evaluate ๊ณผ์ ์์ ์์ฑ๋ ์์ฑ์ ``eval/`` ํด๋์ ์ ์ฅ๋ฉ๋๋ค.
|
| 113 |
-
|
| 114 |
-
# Synthesis
|
| 115 |
-
ํ์ต๋ ํ๋ผ๋ฏธํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ์์ฑ์ ์์ฑํ๋ ๋ช
๋ น์ด๋ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
|
| 116 |
-
```
|
| 117 |
-
python synthesis.py --step 500000
|
| 118 |
-
```
|
| 119 |
-
ํฉ์ฑ๋ ์์ฑ์ ```results/``` directory์์ ํ์ธํ์ค ์ ์์ต๋๋ค.
|
| 120 |
-
|
| 121 |
-
# Pretrained model
|
| 122 |
-
pretrained model(checkpoint)์ [๋ค์ด๋ก๋](https://drive.google.com/file/d/1qkFuNLqPIm-A5mZZDPGK1mnp0_Lh00PN/view?usp=sharing)ํด ์ฃผ์ธ์.
|
| 123 |
-
๊ทธ ํ, ```hparams.py```์ ์๋ ```checkpoint_path``` ๋ณ์์ ๊ธฐ๋ก๋ ๊ฒฝ๋ก์ ์์น์์ผ์ฃผ์๋ฉด ์ฌ์ ํ์ต๋ ๋ชจ๋ธ์ ์ฌ์ฉ ๊ฐ๋ฅํฉ๋๋ค.
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
# Fine-Tuning
|
| 127 |
-
Pretrained model์ ํ์ฉํ์ฌ Fine-tuning์ ํ ๊ฒฝ์ฐ, ์ต์ 30๋ถ ์ด์์ ๋ฐ์ดํฐ๊ฐ ๊ถ์ฅ๋ฉ๋๋ค. 10๋ถ ์ ๋ ๋ถ๋์ ๋ฐ์ดํฐ๋ก ์คํ์ ๋ชฉ์๋ฆฌ์ ๋ฐ์์ ๋์ฒด์ ์ผ๋ก ๋น์ทํ๊ฒ ๋ฐ๋ผํ๋ ๋
ธ์ด์ฆ๊ฐ ์ฌํ์ต๋๋ค.
|
| 128 |
-
|
| 129 |
-
Fine-tuning ์, Learning Rate์ ์กฐ์ ์ด ํ์ํฉ๋๋ค. Learning Rate๋ ์ ๋นํ ๋ฎ์ ๊ฐ์ด ํ์ํ๋ฉฐ, ์ด๋ ๊ฒฝํ์ ์ผ๋ก ์์๋ด์
์ผ ํฉ๋๋ค. (์ ๋ ์ต์ข
step์์์ Learning Rate๋ฅผ ์ฌ์ฉํ์ต๋๋ค.)
|
| 130 |
-
|
| 131 |
-
```
|
| 132 |
-
python train.py --restore_step 350000
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
# Tensorboard
|
| 138 |
-
```
|
| 139 |
-
tensorboard --logdir log/hp.dataset/
|
| 140 |
-
```
|
| 141 |
-
tensorboard log๋ค์ ```log/hp.dataset/``` directory์ ์ ์ฅ๋ฉ๋๋ค. ๊ทธ๋ฌ๋ฏ๋ก ์์ ์ปค๋ฉ๋๋ฅผ ์ด์ฉํ์ฌ tensorboard๋ฅผ ์คํํด ํ์ต ์ํฉ์ ๋ชจ๋ํฐ๋ง ํ์ค ์ ์์ต๋๋ค.
|
| 142 |
-
|
| 143 |
-
|
| 144 |
|
| 145 |
# References
|
| 146 |
- [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558), Y. Ren, *et al*.
|
| 147 |
-
- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263), Y. Ren, *et al*.
|
| 148 |
-
- [ming024's FastSpeech2 impelmentation](https://github.com/ming024/FastSpeech2)
|
| 149 |
-
- [rishikksh20's VocGAN implementation](https://github.com/rishikksh20/VocGAN)
|
| 150 |
- [HGU-DLLAB](https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch)
|
| 151 |
-
- [
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- ko
|
| 5 |
+
---
|
| 6 |
|
| 7 |
# Korean FastSpeech 2 - Pytorch Implementation
|
| 8 |
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
# Dependencies
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
python=3.9,
|
| 12 |
[pytorch](https://pytorch.org/)=1.13, [ffmpeg](https://ffmpeg.org/) [g2pk](https://github.com/Kyubyong/g2pK)
|
| 13 |
```
|
|
|
|
| 17 |
pip install -r requirements.txt
|
| 18 |
```
|
| 19 |
|
| 20 |
+
# Useage
|
| 21 |
+
Data propress
|
| 22 |
+
Train VocGAN model
|
| 23 |
+
Train Fastspeech2 model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
# References
|
| 26 |
- [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558), Y. Ren, *et al*.
|
|
|
|
|
|
|
|
|
|
| 27 |
- [HGU-DLLAB](https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch)
|
| 28 |
+
- [rishikksh20's VocGAN implementation](https://github.com/rishikksh20/VocGAN)
|
| 29 |
+
|