StyleTTS 2 (code, datasets, models, paper)
Browse files- .gitattributes +10 -0
- StyleTTS 2. Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.pdf +3 -0
- code/StyleTTS-VC.zip +3 -0
- code/StyleTTS.zip +3 -0
- code/StyleTTS2.zip +3 -0
- code/stylish-tts.zip +3 -0
- datasets/multilingual-phonemes-10k-alpha/.gitattributes +56 -0
- datasets/multilingual-phonemes-10k-alpha/LICENSE +0 -0
- datasets/multilingual-phonemes-10k-alpha/README.md +102 -0
- datasets/multilingual-phonemes-10k-alpha/ca.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/de.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/el.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/en-xl.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/en.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/es.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/fa.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/fi.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/fr.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/it.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/languages.txt +15 -0
- datasets/multilingual-phonemes-10k-alpha/pl.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/pt.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/ru.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/source.txt +1 -0
- datasets/multilingual-phonemes-10k-alpha/sv.json +0 -0
- datasets/multilingual-phonemes-10k-alpha/uk.json +3 -0
- datasets/multilingual-phonemes-10k-alpha/zh.json +3 -0
- models/ar/StyleTTS2-LibriTTS-arabic/.gitattributes +36 -0
- models/ar/StyleTTS2-LibriTTS-arabic/README.md +142 -0
- models/ar/StyleTTS2-LibriTTS-arabic/config.yml +114 -0
- models/ar/StyleTTS2-LibriTTS-arabic/model.pth +3 -0
- models/ar/StyleTTS2-LibriTTS-arabic/source.txt +1 -0
- models/ar/StyleTTS2-LibriTTS-arabic/synthesized_audio.wav +3 -0
- models/en/StyleTTS2-LibriTTS/.gitattributes +35 -0
- models/en/StyleTTS2-LibriTTS/Models/config.yml +21 -0
- models/en/StyleTTS2-LibriTTS/Models/epochs_2nd_00020.pth +3 -0
- models/en/StyleTTS2-LibriTTS/README.md +100 -0
- models/en/StyleTTS2-LibriTTS/source.txt +1 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
datasets/multilingual-phonemes-10k-alpha/ca.json filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
datasets/multilingual-phonemes-10k-alpha/el.json filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
datasets/multilingual-phonemes-10k-alpha/en-xl.json filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
datasets/multilingual-phonemes-10k-alpha/fa.json filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
datasets/multilingual-phonemes-10k-alpha/pt.json filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
datasets/multilingual-phonemes-10k-alpha/ru.json filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
datasets/multilingual-phonemes-10k-alpha/uk.json filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
datasets/multilingual-phonemes-10k-alpha/zh.json filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
models/ar/StyleTTS2-LibriTTS-arabic/synthesized_audio.wav filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
StyleTTS[[:space:]]2.[[:space:]]Towards[[:space:]]Human-Level[[:space:]]Text-to-Speech[[:space:]]through[[:space:]]Style[[:space:]]Diffusion[[:space:]]and[[:space:]]Adversarial[[:space:]]Training[[:space:]]with[[:space:]]Large[[:space:]]Speech[[:space:]]Language[[:space:]]Models.pdf filter=lfs diff=lfs merge=lfs -text
|
StyleTTS 2. Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f528cab389ea8af17cfcc95ab5847975bb79b00dd46424d6b2d44a1e44017c55
|
| 3 |
+
size 4082571
|
code/StyleTTS-VC.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d26c819eea0d52ba571d8f7dc69ccb8acb3db568ad3983ef0270929406975bf8
|
| 3 |
+
size 215127843
|
code/StyleTTS.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9bb6571ecb71baf369e4a04f9b92c7d56b82af8d71c0f2ec69786de0064ddb1a
|
| 3 |
+
size 235881010
|
code/StyleTTS2.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cdb682c2d4bb88dbd0556de55b3893d5cb96c72bdd4323841ec2296e999c9897
|
| 3 |
+
size 280792923
|
code/stylish-tts.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f941ff08b1be23b6531b127bd41149478facfbf728f0e574434e23fd3c8e7bdc
|
| 3 |
+
size 2739847
|
datasets/multilingual-phonemes-10k-alpha/.gitattributes
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.lz4 filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
# Audio files - uncompressed
|
| 38 |
+
*.pcm filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
*.sam filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
*.raw filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
# Audio files - compressed
|
| 42 |
+
*.aac filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
*.flac filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
*.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
*.ogg filter=lfs diff=lfs merge=lfs -text
|
| 46 |
+
*.wav filter=lfs diff=lfs merge=lfs -text
|
| 47 |
+
# Image files - uncompressed
|
| 48 |
+
*.bmp filter=lfs diff=lfs merge=lfs -text
|
| 49 |
+
*.gif filter=lfs diff=lfs merge=lfs -text
|
| 50 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
| 51 |
+
*.tiff filter=lfs diff=lfs merge=lfs -text
|
| 52 |
+
# Image files - compressed
|
| 53 |
+
*.jpg filter=lfs diff=lfs merge=lfs -text
|
| 54 |
+
*.jpeg filter=lfs diff=lfs merge=lfs -text
|
| 55 |
+
*.webp filter=lfs diff=lfs merge=lfs -text
|
| 56 |
+
*.json filter=lfs diff=lfs merge=lfs -text
|
datasets/multilingual-phonemes-10k-alpha/LICENSE
ADDED
|
File without changes
|
datasets/multilingual-phonemes-10k-alpha/README.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-sa-3.0
|
| 3 |
+
license_name: cc-by-sa
|
| 4 |
+
configs:
|
| 5 |
+
- config_name: en
|
| 6 |
+
data_files: en.json
|
| 7 |
+
default: true
|
| 8 |
+
- config_name: en-xl
|
| 9 |
+
data_files: en-xl.json
|
| 10 |
+
- config_name: ca
|
| 11 |
+
data_files: ca.json
|
| 12 |
+
- config_name: de
|
| 13 |
+
data_files: de.json
|
| 14 |
+
- config_name: es
|
| 15 |
+
data_files: es.json
|
| 16 |
+
- config_name: el
|
| 17 |
+
data_files: el.json
|
| 18 |
+
- config_name: fa
|
| 19 |
+
data_files: fa.json
|
| 20 |
+
- config_name: fi
|
| 21 |
+
data_files: fi.json
|
| 22 |
+
- config_name: fr
|
| 23 |
+
data_files: fr.json
|
| 24 |
+
- config_name: it
|
| 25 |
+
data_files: it.json
|
| 26 |
+
- config_name: pl
|
| 27 |
+
data_files: pl.json
|
| 28 |
+
- config_name: pt
|
| 29 |
+
data_files: pt.json
|
| 30 |
+
- config_name: ru
|
| 31 |
+
data_files: ru.json
|
| 32 |
+
- config_name: sv
|
| 33 |
+
data_files: sv.json
|
| 34 |
+
- config_name: uk
|
| 35 |
+
data_files: uk.json
|
| 36 |
+
- config_name: zh
|
| 37 |
+
data_files: zh.json
|
| 38 |
+
language:
|
| 39 |
+
- en
|
| 40 |
+
- ca
|
| 41 |
+
- de
|
| 42 |
+
- es
|
| 43 |
+
- el
|
| 44 |
+
- fa
|
| 45 |
+
- fi
|
| 46 |
+
- fr
|
| 47 |
+
- it
|
| 48 |
+
- pl
|
| 49 |
+
- pt
|
| 50 |
+
- ru
|
| 51 |
+
- sv
|
| 52 |
+
- uk
|
| 53 |
+
- zh
|
| 54 |
+
tags:
|
| 55 |
+
- synthetic
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
# Multilingual Phonemes 10K Alpha
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
This dataset contains approximately 10,000 pairs of text and phonemes from each supported language. We support 15 languages in this dataset, so we have a total of ~150K pairs. This does not include the English-XL dataset, which includes another 100K unique rows.
|
| 62 |
+
|
| 63 |
+
## Languages
|
| 64 |
+
|
| 65 |
+
We support 15 languages, which means we have around 150,000 pairs of text and phonemes in multiple languages. This excludes the English-XL dataset, which has 100K unique (not included in any other split) additional phonemized pairs.
|
| 66 |
+
|
| 67 |
+
* English (en)
|
| 68 |
+
* English-XL (en-xl): ~100K phonemized pairs, English-only
|
| 69 |
+
* Catalan (ca)
|
| 70 |
+
* German (de)
|
| 71 |
+
* Spanish (es)
|
| 72 |
+
* Greek (el)
|
| 73 |
+
* Persian (fa): Requested by [@Respair](https://huggingface.co/Respair)
|
| 74 |
+
* Finnish (fi)
|
| 75 |
+
* French (fr)
|
| 76 |
+
* Italian (it)
|
| 77 |
+
* Polish (pl)
|
| 78 |
+
* Portuguese (pt)
|
| 79 |
+
* Russian (ru)
|
| 80 |
+
* Swedish (sw)
|
| 81 |
+
* Ukrainian (uk)
|
| 82 |
+
* Chinese (zh): Thank you to [@eugenepentland](https://huggingface.co/eugenepentland) for assistance in processing this text, as East-Asian languages are the most compute-intensive!
|
| 83 |
+
|
| 84 |
+
## License + Credits
|
| 85 |
+
|
| 86 |
+
Source data comes from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and is licensed under CC-BY-SA 3.0. This dataset is licensed under CC-BY-SA 3.0.
|
| 87 |
+
|
| 88 |
+
## Processing
|
| 89 |
+
|
| 90 |
+
We utilized the following process to preprocess the dataset:
|
| 91 |
+
|
| 92 |
+
1. Download data from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) by language, selecting only the first Parquet file and naming it with the language code
|
| 93 |
+
2. Process using [Data Preprocessing Scripts (StyleTTS 2 Community members only)](https://huggingface.co/styletts2-community/data-preprocessing-scripts) and modify the code to work with the language
|
| 94 |
+
3. Script: Clean the text
|
| 95 |
+
4. Script: Remove ultra-short phrases
|
| 96 |
+
5. Script: Phonemize
|
| 97 |
+
6. Script: Save JSON
|
| 98 |
+
7. Upload dataset
|
| 99 |
+
|
| 100 |
+
## Note
|
| 101 |
+
|
| 102 |
+
East-Asian languages are experimental. We do not distinguish between Traditional and Simplified Chinese. The dataset consists mainly of Simplified Chinese in the `zh` split. We recommend converting characters to Simplified Chinese during inference, using a library such as `hanziconv` or `chinese-converter`.
|
datasets/multilingual-phonemes-10k-alpha/ca.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:993ac7b508efb60c009c97861941676699549a82828898d214e40b0d43c459ab
|
| 3 |
+
size 10888235
|
datasets/multilingual-phonemes-10k-alpha/de.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/el.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a8525fbdbfce41ad9bd29b390c49854edcdd942a35e70631bae52e3210cb49a5
|
| 3 |
+
size 12178762
|
datasets/multilingual-phonemes-10k-alpha/en-xl.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:56856fb3ef4e8eb9d6335c8199d51bb40eee4affbb8d344cc76e317fc72b8d45
|
| 3 |
+
size 84063713
|
datasets/multilingual-phonemes-10k-alpha/en.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/es.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/fa.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f33e05a4a5e3ee0104e0599c4830d4ff505debc5d77eaab5b989e9d8d1eda23e
|
| 3 |
+
size 18121401
|
datasets/multilingual-phonemes-10k-alpha/fi.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/fr.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/it.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/languages.txt
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
English
|
| 2 |
+
Catalan
|
| 3 |
+
German
|
| 4 |
+
Spanish
|
| 5 |
+
Greek
|
| 6 |
+
Persian
|
| 7 |
+
Finnish
|
| 8 |
+
French
|
| 9 |
+
Italian
|
| 10 |
+
Polish
|
| 11 |
+
Portuguese
|
| 12 |
+
Russian
|
| 13 |
+
Swedish
|
| 14 |
+
Ukrainian
|
| 15 |
+
Chinese
|
datasets/multilingual-phonemes-10k-alpha/pl.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/pt.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:458b4ec76cb331fbcdd1da6f83ccf0277fed2b0866a6c0ac25ccc47f1f4c9fac
|
| 3 |
+
size 11076509
|
datasets/multilingual-phonemes-10k-alpha/ru.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:29cea02458a6e82bcfe35a30b12ebf29031200c26b2c4c621b76d975e3cfd27e
|
| 3 |
+
size 15753792
|
datasets/multilingual-phonemes-10k-alpha/source.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
https://huggingface.co/datasets/styletts2-community/multilingual-phonemes-10k-alpha
|
datasets/multilingual-phonemes-10k-alpha/sv.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
datasets/multilingual-phonemes-10k-alpha/uk.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4965c576559b6542a8b4127b1b13e81913b736f4eb6852b494dfbf7010866287
|
| 3 |
+
size 13127182
|
datasets/multilingual-phonemes-10k-alpha/zh.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d6f6acdd4b4e05158361e2ddc1621172355a8b9b330f06a3c0873052d3ca44fc
|
| 3 |
+
size 20594478
|
models/ar/StyleTTS2-LibriTTS-arabic/.gitattributes
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
synthesized_audio.wav filter=lfs diff=lfs merge=lfs -text
|
models/ar/StyleTTS2-LibriTTS-arabic/README.md
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: ar
|
| 3 |
+
tags:
|
| 4 |
+
- text-to-speech
|
| 5 |
+
- tts
|
| 6 |
+
- arabic
|
| 7 |
+
- styletts2
|
| 8 |
+
- pl-bert
|
| 9 |
+
license: mit
|
| 10 |
+
hardware: H100
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Model Card for Arabic StyleTTS2
|
| 14 |
+
|
| 15 |
+
This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech.
|
| 16 |
+
|
| 17 |
+
## Example
|
| 18 |
+
|
| 19 |
+
Here is an example output from the model:
|
| 20 |
+
|
| 21 |
+
#### Sample 1
|
| 22 |
+
<audio controls>
|
| 23 |
+
<source src="https://huggingface.co/fadi77/StyleTTS2-LibriTTS-arabic/resolve/main/synthesized_audio.wav" type="audio/wav">
|
| 24 |
+
Your browser does not support the audio element.
|
| 25 |
+
</audio>
|
| 26 |
+
|
| 27 |
+
## Efficiency and Performance
|
| 28 |
+
|
| 29 |
+
A key strength of this model lies in its efficiency and performance characteristics:
|
| 30 |
+
|
| 31 |
+
- **Compact Architecture**: Achieves impressive quality with <100M parameters
|
| 32 |
+
- **Limited Training Data**: Trained on only 22 hours of single-speaker audio
|
| 33 |
+
- **Transfer Learning**: Successfully fine-tuned from LibriTTS multi-speaker model to single-speaker Arabic
|
| 34 |
+
- **Resource Efficient**: Good quality achieved despite limited computational resources
|
| 35 |
+
|
| 36 |
+
Note: According to the StyleTTS2 authors, performance should improve further when training a single-speaker model from scratch rather than fine-tuning. This wasn't attempted in our case due to computational resource constraints, suggesting potential for even better results with more extensive training.
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
## Model Details
|
| 40 |
+
|
| 41 |
+
### Model Description
|
| 42 |
+
|
| 43 |
+
This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English).
|
| 44 |
+
|
| 45 |
+
- **Developed by:** Fadi (GitHub: Fadi987)
|
| 46 |
+
- **Model type:** Text-to-Speech (StyleTTS2 architecture)
|
| 47 |
+
- **Language(s):** Arabic
|
| 48 |
+
- **Finetuned from model:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)
|
| 49 |
+
|
| 50 |
+
### Model Sources
|
| 51 |
+
|
| 52 |
+
- **Repository:** [Fadi987/StyleTTS2](https://github.com/Fadi987/StyleTTS2)
|
| 53 |
+
- **Paper:** [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691)
|
| 54 |
+
- **PL-BERT Model:** [fadi77/pl-bert](https://huggingface.co/fadi77/pl-bert)
|
| 55 |
+
|
| 56 |
+
## Uses
|
| 57 |
+
|
| 58 |
+
### Direct Use
|
| 59 |
+
|
| 60 |
+
The model can be used for generating Arabic speech from text. To use the model:
|
| 61 |
+
|
| 62 |
+
1. Clone the StyleTTS2 repository:
|
| 63 |
+
```bash
|
| 64 |
+
git clone https://github.com/Fadi987/StyleTTS2
|
| 65 |
+
cd StyleTTS2
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
2. Install `espeak-ng` for phonemization backend:
|
| 69 |
+
```bash
|
| 70 |
+
# For macOS
|
| 71 |
+
brew install espeak-ng
|
| 72 |
+
|
| 73 |
+
# For Ubuntu/Debian
|
| 74 |
+
sudo apt-get install espeak-ng
|
| 75 |
+
|
| 76 |
+
# For Windows
|
| 77 |
+
# Download and install espeak-ng from: https://github.com/espeak-ng/espeak-ng/releases
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
3. Install Python dependencies:
|
| 81 |
+
```bash
|
| 82 |
+
pip install -r requirements.txt
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
4. Download the `model.pth` and `config.yml` files from this repository
|
| 86 |
+
|
| 87 |
+
5. Run inference using:
|
| 88 |
+
```bash
|
| 89 |
+
python inference.py --config config.yml --model model.pth --text "الإِتْقَانُ يَحْتَاجُ إِلَى الْعَمَلِ وَالْمُثَابَرَة"
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
Make sure use properly diacritized Arabic text for best results
|
| 93 |
+
|
| 94 |
+
### Out-of-Scope Use
|
| 95 |
+
|
| 96 |
+
The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for:
|
| 97 |
+
- Other languages
|
| 98 |
+
- Heavy dialect variations
|
| 99 |
+
- Non-diacritized Arabic text
|
| 100 |
+
|
| 101 |
+
## Training Details
|
| 102 |
+
|
| 103 |
+
### Training Data
|
| 104 |
+
|
| 105 |
+
- Training was performed on approximately 22 hours of Arabic audiobook data
|
| 106 |
+
- Dataset: [fadi77/arabic-audiobook-dataset-24khz](https://huggingface.co/datasets/fadi77/arabic-audiobook-dataset-24khz)
|
| 107 |
+
- The PL-BERT component was trained on fully diacritized Wikipedia Arabic text
|
| 108 |
+
|
| 109 |
+
### Training Hyperparameters
|
| 110 |
+
|
| 111 |
+
- **Number of epochs:** 20
|
| 112 |
+
- **Diffusion training:** Started from epoch 5
|
| 113 |
+
|
| 114 |
+
### Objectives
|
| 115 |
+
- **Training objectives:** All original StyleTTS2 objectives maintained, except WavLM adversarial training
|
| 116 |
+
- **Validation objectives:** Identical to original StyleTTS2 validation process
|
| 117 |
+
|
| 118 |
+
### Compute Infrastructure
|
| 119 |
+
- **Hardware Type:** NVIDIA H100 GPU
|
| 120 |
+
|
| 121 |
+
### Notable Modifications from Original StyleTTS2 in Architecture and Objectives
|
| 122 |
+
The architecture of the model follows that of StyleTTS2 with the following exceptions:
|
| 123 |
+
- Removed WavLM adversarial training component
|
| 124 |
+
- Custom PL-BERT trained for Arabic language
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
## Citation
|
| 128 |
+
|
| 129 |
+
**BibTeX:**
|
| 130 |
+
```bibtex
|
| 131 |
+
@article{styletts2,
|
| 132 |
+
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
|
| 133 |
+
author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others},
|
| 134 |
+
journal={arXiv preprint arXiv:2306.07691},
|
| 135 |
+
year={2023}
|
| 136 |
+
}
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
## Model Card Contact
|
| 140 |
+
|
| 141 |
+
GitHub: [@Fadi987](https://github.com/Fadi987)
|
| 142 |
+
Hugging Face: [@fadi77](https://huggingface.co/fadi77)
|
models/ar/StyleTTS2-LibriTTS-arabic/config.yml
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
log_dir: "/style_tts2/Models/FineTune.AudioBook"
|
| 2 |
+
log_interval: 10
|
| 3 |
+
device: "cuda"
|
| 4 |
+
epochs: 25 # number of finetuning epoch
|
| 5 |
+
batch_size: 6
|
| 6 |
+
max_len: 300 # maximum number of frames
|
| 7 |
+
pretrained_model_repo: "yl4579/StyleTTS2-LibriTTS"
|
| 8 |
+
pretrained_model_filename: "Models/LibriTTS/epochs_2nd_00020.pth"
|
| 9 |
+
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
|
| 10 |
+
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
|
| 11 |
+
|
| 12 |
+
F0_path: "/root/Utils/JDC/bst.t7"
|
| 13 |
+
ASR_config: "/root/Utils/ASR/config.yml"
|
| 14 |
+
ASR_path: "/root/Utils/ASR/epoch_00080.pth"
|
| 15 |
+
PLBERT_repo_id: "fadi77/pl-bert"
|
| 16 |
+
PLBERT_dirname: "models/mlm_only_with_diacritics"
|
| 17 |
+
|
| 18 |
+
data_params:
|
| 19 |
+
train_data: "Data/youtube_train_list.txt"
|
| 20 |
+
val_data: "Data/youtube_val_list.txt"
|
| 21 |
+
root_path: "Youtube/wavs"
|
| 22 |
+
OOD_data: "Data/youtube_train_list.txt"
|
| 23 |
+
min_length: 50 # sample until texts with this size are obtained for OOD texts
|
| 24 |
+
|
| 25 |
+
preprocess_params:
|
| 26 |
+
sr: 24000
|
| 27 |
+
spect_params:
|
| 28 |
+
n_fft: 2048
|
| 29 |
+
win_length: 1200
|
| 30 |
+
hop_length: 300
|
| 31 |
+
|
| 32 |
+
model_params:
|
| 33 |
+
multispeaker: false
|
| 34 |
+
|
| 35 |
+
dim_in: 64
|
| 36 |
+
hidden_dim: 512
|
| 37 |
+
max_conv_dim: 512
|
| 38 |
+
n_layer: 3
|
| 39 |
+
n_mels: 80
|
| 40 |
+
|
| 41 |
+
n_token: 178 # number of phoneme tokens
|
| 42 |
+
max_dur: 50 # maximum duration of a single phoneme
|
| 43 |
+
style_dim: 128 # style vector size
|
| 44 |
+
|
| 45 |
+
dropout: 0.2
|
| 46 |
+
|
| 47 |
+
# config for decoder
|
| 48 |
+
decoder:
|
| 49 |
+
type: 'hifigan' # either hifigan or istftnet
|
| 50 |
+
resblock_kernel_sizes: [3,7,11]
|
| 51 |
+
upsample_rates : [10,5,3,2]
|
| 52 |
+
upsample_initial_channel: 512
|
| 53 |
+
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
|
| 54 |
+
upsample_kernel_sizes: [20,10,6,4]
|
| 55 |
+
|
| 56 |
+
# speech language model config
|
| 57 |
+
slm:
|
| 58 |
+
model: 'microsoft/wavlm-base-plus'
|
| 59 |
+
sr: 16000 # sampling rate of SLM
|
| 60 |
+
hidden: 768 # hidden size of SLM
|
| 61 |
+
nlayers: 13 # number of layers of SLM
|
| 62 |
+
initial_channel: 64 # initial channels of SLM discriminator head
|
| 63 |
+
|
| 64 |
+
# style diffusion model config
|
| 65 |
+
diffusion:
|
| 66 |
+
embedding_mask_proba: 0.1
|
| 67 |
+
# transformer config
|
| 68 |
+
transformer:
|
| 69 |
+
num_layers: 3
|
| 70 |
+
num_heads: 8
|
| 71 |
+
head_features: 64
|
| 72 |
+
multiplier: 2
|
| 73 |
+
|
| 74 |
+
# diffusion distribution config
|
| 75 |
+
dist:
|
| 76 |
+
sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
|
| 77 |
+
estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
|
| 78 |
+
mean: -3.0
|
| 79 |
+
std: 1.0
|
| 80 |
+
|
| 81 |
+
loss_params:
|
| 82 |
+
lambda_mel: 5. # mel reconstruction loss
|
| 83 |
+
lambda_gen: 1. # generator loss
|
| 84 |
+
lambda_slm: 1. # slm feature matching loss
|
| 85 |
+
|
| 86 |
+
lambda_mono: 1. # monotonic alignment loss (TMA)
|
| 87 |
+
lambda_s2s: 1. # sequence-to-sequence loss (TMA)
|
| 88 |
+
|
| 89 |
+
lambda_F0: 1. # F0 reconstruction loss
|
| 90 |
+
lambda_norm: 1. # norm reconstruction loss
|
| 91 |
+
lambda_dur: 1. # duration loss
|
| 92 |
+
lambda_ce: 20. # duration predictor probability output CE loss
|
| 93 |
+
lambda_sty: 1. # style reconstruction loss
|
| 94 |
+
lambda_diff: 1. # score matching loss
|
| 95 |
+
|
| 96 |
+
# Note: Current values for training are only adequate for second stage finetuning.
|
| 97 |
+
diffusion_training_epoch: 5
|
| 98 |
+
joint_training_epoch: 100
|
| 99 |
+
|
| 100 |
+
# Note: Current values for learnings rates are very low. This is only adequate for second stage finetuning.
|
| 101 |
+
optimizer_params:
|
| 102 |
+
lr: 0.0001 # general learning rate
|
| 103 |
+
bert_lr: 0.00001 # learning rate for PLBERT
|
| 104 |
+
ft_lr: 0.0001 # learning rate for acoustic modules
|
| 105 |
+
|
| 106 |
+
slmadv_params:
|
| 107 |
+
min_len: 400 # minimum length of samples
|
| 108 |
+
max_len: 500 # maximum length of samples
|
| 109 |
+
batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
|
| 110 |
+
skip_update: 10 # update the discriminator every this iterations of generator update
|
| 111 |
+
thresh: 5 # gradient norm above which the gradient is scaled
|
| 112 |
+
scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
|
| 113 |
+
sig: 1.5 # sigma for differentiable duration modeling
|
| 114 |
+
|
models/ar/StyleTTS2-LibriTTS-arabic/model.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:59d2323412f0c55c774b5675b45e5c12659c0d9e0f9e7012eecc6b7dd845b132
|
| 3 |
+
size 2201968238
|
models/ar/StyleTTS2-LibriTTS-arabic/source.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
https://huggingface.co/fadi77/StyleTTS2-LibriTTS-arabic
|
models/ar/StyleTTS2-LibriTTS-arabic/synthesized_audio.wav
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f60e90523d734eff1b9f4b95cca49f22277df5cb4acd0bd347fde18f1c3b0469
|
| 3 |
+
size 1795058
|
models/en/StyleTTS2-LibriTTS/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/en/StyleTTS2-LibriTTS/Models/config.yml
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{ASR_config: Utils/ASR/config.yml, ASR_path: Utils/ASR/epoch_00080.pth, F0_path: Utils/JDC/bst.t7,
|
| 2 |
+
PLBERT_dir: Utils/PLBERT/, batch_size: 8, data_params: {OOD_data: Data/OOD_texts.txt,
|
| 3 |
+
min_length: 50, root_path: '', train_data: Data/train_list.txt, val_data: Data/val_list.txt},
|
| 4 |
+
device: cuda, epochs_1st: 40, epochs_2nd: 25, first_stage_path: first_stage.pth,
|
| 5 |
+
load_only_params: false, log_dir: Models/LibriTTS, log_interval: 10, loss_params: {
|
| 6 |
+
TMA_epoch: 4, diff_epoch: 0, joint_epoch: 0, lambda_F0: 1.0, lambda_ce: 20.0,
|
| 7 |
+
lambda_diff: 1.0, lambda_dur: 1.0, lambda_gen: 1.0, lambda_mel: 5.0, lambda_mono: 1.0,
|
| 8 |
+
lambda_norm: 1.0, lambda_s2s: 1.0, lambda_slm: 1.0, lambda_sty: 1.0}, max_len: 300,
|
| 9 |
+
model_params: {decoder: {resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3,
|
| 10 |
+
5]], resblock_kernel_sizes: [3, 7, 11], type: hifigan, upsample_initial_channel: 512,
|
| 11 |
+
upsample_kernel_sizes: [20, 10, 6, 4], upsample_rates: [10, 5, 3, 2]}, diffusion: {
|
| 12 |
+
dist: {estimate_sigma_data: true, mean: -3.0, sigma_data: 0.19926648961191362,
|
| 13 |
+
std: 1.0}, embedding_mask_proba: 0.1, transformer: {head_features: 64, multiplier: 2,
|
| 14 |
+
num_heads: 8, num_layers: 3}}, dim_in: 64, dropout: 0.2, hidden_dim: 512,
|
| 15 |
+
max_conv_dim: 512, max_dur: 50, multispeaker: true, n_layer: 3, n_mels: 80, n_token: 178,
|
| 16 |
+
slm: {hidden: 768, initial_channel: 64, model: microsoft/wavlm-base-plus, nlayers: 13,
|
| 17 |
+
sr: 16000}, style_dim: 128}, optimizer_params: {bert_lr: 1.0e-05, ft_lr: 1.0e-05,
|
| 18 |
+
lr: 0.0001}, preprocess_params: {spect_params: {hop_length: 300, n_fft: 2048,
|
| 19 |
+
win_length: 1200}, sr: 24000}, pretrained_model: Models/LibriTTS/epoch_2nd_00002.pth,
|
| 20 |
+
save_freq: 1, second_stage_load_pretrained: true, slmadv_params: {batch_percentage: 0.5,
|
| 21 |
+
iter: 20, max_len: 500, min_len: 400, scale: 0.01, sig: 1.5, thresh: 5}}
|
models/en/StyleTTS2-LibriTTS/Models/epochs_2nd_00020.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1164ffe19a17449d2c722234cecaf2836b35a698fb8ffd42562d2663657dca0a
|
| 3 |
+
size 771390526
|
models/en/StyleTTS2-LibriTTS/README.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- multilingual
|
| 5 |
+
tags:
|
| 6 |
+
- text-to-speech
|
| 7 |
+
- speech-synthesis
|
| 8 |
+
- pytorch
|
| 9 |
+
- styletts2
|
| 10 |
+
- speaches
|
| 11 |
+
- neural-tts
|
| 12 |
+
- voice-cloning
|
| 13 |
+
pipeline_tag: text-to-speech
|
| 14 |
+
library_name: pytorch
|
| 15 |
+
license: mit
|
| 16 |
+
datasets:
|
| 17 |
+
- LibriTTS
|
| 18 |
+
metrics:
|
| 19 |
+
- naturalness
|
| 20 |
+
- similarity
|
| 21 |
+
widget:
|
| 22 |
+
- text: "Hello, this is a sample of StyleTTS2 speech synthesis."
|
| 23 |
+
example_title: "English Sample"
|
| 24 |
+
- text: "StyleTTS2 can synthesize high-quality speech with style control."
|
| 25 |
+
example_title: "Style Control Sample"
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
# StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training
|
| 29 |
+
|
| 30 |
+
StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level text-to-speech synthesis. This model builds upon the original StyleTTS with significant improvements in naturalness and similarity.
|
| 31 |
+
|
| 32 |
+
## Model Description
|
| 33 |
+
|
| 34 |
+
- **Model Type**: Neural Text-to-Speech (TTS)
|
| 35 |
+
- **Language(s)**: English (primary), with support for 18+ languages
|
| 36 |
+
- **License**: MIT
|
| 37 |
+
- **Paper**: [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training](https://arxiv.org/abs/2306.07691)
|
| 38 |
+
- **Sample Rate**: 24,000 Hz
|
| 39 |
+
- **Architecture**: Style diffusion with adversarial training
|
| 40 |
+
|
| 41 |
+
## Features
|
| 42 |
+
|
| 43 |
+
- **High-Quality Synthesis**: Achieves human-level naturalness in speech synthesis
|
| 44 |
+
- **Style Control**: Advanced style transfer and voice cloning capabilities
|
| 45 |
+
- **Multi-Language Support**: Primary English model with support for 18+ additional languages
|
| 46 |
+
- **Voice Cloning**: Can clone voices from reference audio samples
|
| 47 |
+
- **Diffusion-Based**: Uses diffusion models for high-quality audio generation
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
This model is designed for text-to-speech synthesis with the following capabilities:
|
| 52 |
+
|
| 53 |
+
1. **Multi-Voice Synthesis**: Generate speech using preset voice styles
|
| 54 |
+
2. **Voice Cloning**: Clone voices from reference audio samples
|
| 55 |
+
3. **Style Control**: Fine-tune synthesis parameters for different styles
|
| 56 |
+
4. **Multi-Language**: Support for various languages with English-accented pronunciation
|
| 57 |
+
|
| 58 |
+
### Parameters
|
| 59 |
+
|
| 60 |
+
- `alpha` (0.0-1.0): Style blending factor (default: 0.3)
|
| 61 |
+
- `beta` (0.0-1.0): Style mixing factor (default: 0.7)
|
| 62 |
+
- `diffusion_steps` (3-20): Number of diffusion steps for quality (default: 5)
|
| 63 |
+
- `embedding_scale` (1.0-10.0): Embedding scale factor (default: 1.0)
|
| 64 |
+
|
| 65 |
+
## Training Data
|
| 66 |
+
|
| 67 |
+
- **Primary Dataset**: LibriTTS
|
| 68 |
+
- **Languages**: English (primary) + 18 additional languages
|
| 69 |
+
- **Training Approach**: Style diffusion with adversarial training using large speech language models
|
| 70 |
+
|
| 71 |
+
## Performance
|
| 72 |
+
|
| 73 |
+
StyleTTS 2 achieves human-level performance in:
|
| 74 |
+
- **Naturalness**: Comparable to human speech in listening tests
|
| 75 |
+
- **Similarity**: High fidelity voice cloning and style transfer
|
| 76 |
+
- **Quality**: Superior audio quality compared to previous TTS models
|
| 77 |
+
|
| 78 |
+
## Limitations
|
| 79 |
+
|
| 80 |
+
- **Compute Requirements**: Requires significant computational resources for inference
|
| 81 |
+
- **English-First**: Optimized for English, other languages may have accented pronunciation
|
| 82 |
+
- **Context Dependency**: Performance varies with input text length and complexity
|
| 83 |
+
|
| 84 |
+
## Citation
|
| 85 |
+
|
| 86 |
+
```bibtex
|
| 87 |
+
@article{li2024styletts2,
|
| 88 |
+
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
|
| 89 |
+
author={Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
|
| 90 |
+
journal={arXiv preprint arXiv:2306.07691},
|
| 91 |
+
year={2024}
|
| 92 |
+
}
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## Links
|
| 96 |
+
|
| 97 |
+
- Paper: [https://arxiv.org/abs/2306.07691](https://arxiv.org/abs/2306.07691)
|
| 98 |
+
- Samples: [https://styletts2.github.io/](https://styletts2.github.io/)
|
| 99 |
+
- Code: [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
|
| 100 |
+
- License: MIT License
|
models/en/StyleTTS2-LibriTTS/source.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
https://huggingface.co/jakezp/StyleTTS2-LibriTTS
|