niobures commited on
Commit
4bfeae6
·
verified ·
1 Parent(s): b60c00f

StyleTTS 2 (code, datasets, models, paper)

Browse files
Files changed (38) hide show
  1. .gitattributes +10 -0
  2. StyleTTS 2. Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.pdf +3 -0
  3. code/StyleTTS-VC.zip +3 -0
  4. code/StyleTTS.zip +3 -0
  5. code/StyleTTS2.zip +3 -0
  6. code/stylish-tts.zip +3 -0
  7. datasets/multilingual-phonemes-10k-alpha/.gitattributes +56 -0
  8. datasets/multilingual-phonemes-10k-alpha/LICENSE +0 -0
  9. datasets/multilingual-phonemes-10k-alpha/README.md +102 -0
  10. datasets/multilingual-phonemes-10k-alpha/ca.json +3 -0
  11. datasets/multilingual-phonemes-10k-alpha/de.json +0 -0
  12. datasets/multilingual-phonemes-10k-alpha/el.json +3 -0
  13. datasets/multilingual-phonemes-10k-alpha/en-xl.json +3 -0
  14. datasets/multilingual-phonemes-10k-alpha/en.json +0 -0
  15. datasets/multilingual-phonemes-10k-alpha/es.json +0 -0
  16. datasets/multilingual-phonemes-10k-alpha/fa.json +3 -0
  17. datasets/multilingual-phonemes-10k-alpha/fi.json +0 -0
  18. datasets/multilingual-phonemes-10k-alpha/fr.json +0 -0
  19. datasets/multilingual-phonemes-10k-alpha/it.json +0 -0
  20. datasets/multilingual-phonemes-10k-alpha/languages.txt +15 -0
  21. datasets/multilingual-phonemes-10k-alpha/pl.json +0 -0
  22. datasets/multilingual-phonemes-10k-alpha/pt.json +3 -0
  23. datasets/multilingual-phonemes-10k-alpha/ru.json +3 -0
  24. datasets/multilingual-phonemes-10k-alpha/source.txt +1 -0
  25. datasets/multilingual-phonemes-10k-alpha/sv.json +0 -0
  26. datasets/multilingual-phonemes-10k-alpha/uk.json +3 -0
  27. datasets/multilingual-phonemes-10k-alpha/zh.json +3 -0
  28. models/ar/StyleTTS2-LibriTTS-arabic/.gitattributes +36 -0
  29. models/ar/StyleTTS2-LibriTTS-arabic/README.md +142 -0
  30. models/ar/StyleTTS2-LibriTTS-arabic/config.yml +114 -0
  31. models/ar/StyleTTS2-LibriTTS-arabic/model.pth +3 -0
  32. models/ar/StyleTTS2-LibriTTS-arabic/source.txt +1 -0
  33. models/ar/StyleTTS2-LibriTTS-arabic/synthesized_audio.wav +3 -0
  34. models/en/StyleTTS2-LibriTTS/.gitattributes +35 -0
  35. models/en/StyleTTS2-LibriTTS/Models/config.yml +21 -0
  36. models/en/StyleTTS2-LibriTTS/Models/epochs_2nd_00020.pth +3 -0
  37. models/en/StyleTTS2-LibriTTS/README.md +100 -0
  38. models/en/StyleTTS2-LibriTTS/source.txt +1 -0
.gitattributes CHANGED
@@ -33,3 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ datasets/multilingual-phonemes-10k-alpha/ca.json filter=lfs diff=lfs merge=lfs -text
37
+ datasets/multilingual-phonemes-10k-alpha/el.json filter=lfs diff=lfs merge=lfs -text
38
+ datasets/multilingual-phonemes-10k-alpha/en-xl.json filter=lfs diff=lfs merge=lfs -text
39
+ datasets/multilingual-phonemes-10k-alpha/fa.json filter=lfs diff=lfs merge=lfs -text
40
+ datasets/multilingual-phonemes-10k-alpha/pt.json filter=lfs diff=lfs merge=lfs -text
41
+ datasets/multilingual-phonemes-10k-alpha/ru.json filter=lfs diff=lfs merge=lfs -text
42
+ datasets/multilingual-phonemes-10k-alpha/uk.json filter=lfs diff=lfs merge=lfs -text
43
+ datasets/multilingual-phonemes-10k-alpha/zh.json filter=lfs diff=lfs merge=lfs -text
44
+ models/ar/StyleTTS2-LibriTTS-arabic/synthesized_audio.wav filter=lfs diff=lfs merge=lfs -text
45
+ StyleTTS[[:space:]]2.[[:space:]]Towards[[:space:]]Human-Level[[:space:]]Text-to-Speech[[:space:]]through[[:space:]]Style[[:space:]]Diffusion[[:space:]]and[[:space:]]Adversarial[[:space:]]Training[[:space:]]with[[:space:]]Large[[:space:]]Speech[[:space:]]Language[[:space:]]Models.pdf filter=lfs diff=lfs merge=lfs -text
StyleTTS 2. Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f528cab389ea8af17cfcc95ab5847975bb79b00dd46424d6b2d44a1e44017c55
3
+ size 4082571
code/StyleTTS-VC.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d26c819eea0d52ba571d8f7dc69ccb8acb3db568ad3983ef0270929406975bf8
3
+ size 215127843
code/StyleTTS.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9bb6571ecb71baf369e4a04f9b92c7d56b82af8d71c0f2ec69786de0064ddb1a
3
+ size 235881010
code/StyleTTS2.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cdb682c2d4bb88dbd0556de55b3893d5cb96c72bdd4323841ec2296e999c9897
3
+ size 280792923
code/stylish-tts.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f941ff08b1be23b6531b127bd41149478facfbf728f0e574434e23fd3c8e7bdc
3
+ size 2739847
datasets/multilingual-phonemes-10k-alpha/.gitattributes ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.lz4 filter=lfs diff=lfs merge=lfs -text
12
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
13
+ *.model filter=lfs diff=lfs merge=lfs -text
14
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
15
+ *.npy filter=lfs diff=lfs merge=lfs -text
16
+ *.npz filter=lfs diff=lfs merge=lfs -text
17
+ *.onnx filter=lfs diff=lfs merge=lfs -text
18
+ *.ot filter=lfs diff=lfs merge=lfs -text
19
+ *.parquet filter=lfs diff=lfs merge=lfs -text
20
+ *.pb filter=lfs diff=lfs merge=lfs -text
21
+ *.pickle filter=lfs diff=lfs merge=lfs -text
22
+ *.pkl filter=lfs diff=lfs merge=lfs -text
23
+ *.pt filter=lfs diff=lfs merge=lfs -text
24
+ *.pth filter=lfs diff=lfs merge=lfs -text
25
+ *.rar filter=lfs diff=lfs merge=lfs -text
26
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
27
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
29
+ *.tar filter=lfs diff=lfs merge=lfs -text
30
+ *.tflite filter=lfs diff=lfs merge=lfs -text
31
+ *.tgz filter=lfs diff=lfs merge=lfs -text
32
+ *.wasm filter=lfs diff=lfs merge=lfs -text
33
+ *.xz filter=lfs diff=lfs merge=lfs -text
34
+ *.zip filter=lfs diff=lfs merge=lfs -text
35
+ *.zst filter=lfs diff=lfs merge=lfs -text
36
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
37
+ # Audio files - uncompressed
38
+ *.pcm filter=lfs diff=lfs merge=lfs -text
39
+ *.sam filter=lfs diff=lfs merge=lfs -text
40
+ *.raw filter=lfs diff=lfs merge=lfs -text
41
+ # Audio files - compressed
42
+ *.aac filter=lfs diff=lfs merge=lfs -text
43
+ *.flac filter=lfs diff=lfs merge=lfs -text
44
+ *.mp3 filter=lfs diff=lfs merge=lfs -text
45
+ *.ogg filter=lfs diff=lfs merge=lfs -text
46
+ *.wav filter=lfs diff=lfs merge=lfs -text
47
+ # Image files - uncompressed
48
+ *.bmp filter=lfs diff=lfs merge=lfs -text
49
+ *.gif filter=lfs diff=lfs merge=lfs -text
50
+ *.png filter=lfs diff=lfs merge=lfs -text
51
+ *.tiff filter=lfs diff=lfs merge=lfs -text
52
+ # Image files - compressed
53
+ *.jpg filter=lfs diff=lfs merge=lfs -text
54
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
55
+ *.webp filter=lfs diff=lfs merge=lfs -text
56
+ *.json filter=lfs diff=lfs merge=lfs -text
datasets/multilingual-phonemes-10k-alpha/LICENSE ADDED
File without changes
datasets/multilingual-phonemes-10k-alpha/README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-3.0
3
+ license_name: cc-by-sa
4
+ configs:
5
+ - config_name: en
6
+ data_files: en.json
7
+ default: true
8
+ - config_name: en-xl
9
+ data_files: en-xl.json
10
+ - config_name: ca
11
+ data_files: ca.json
12
+ - config_name: de
13
+ data_files: de.json
14
+ - config_name: es
15
+ data_files: es.json
16
+ - config_name: el
17
+ data_files: el.json
18
+ - config_name: fa
19
+ data_files: fa.json
20
+ - config_name: fi
21
+ data_files: fi.json
22
+ - config_name: fr
23
+ data_files: fr.json
24
+ - config_name: it
25
+ data_files: it.json
26
+ - config_name: pl
27
+ data_files: pl.json
28
+ - config_name: pt
29
+ data_files: pt.json
30
+ - config_name: ru
31
+ data_files: ru.json
32
+ - config_name: sv
33
+ data_files: sv.json
34
+ - config_name: uk
35
+ data_files: uk.json
36
+ - config_name: zh
37
+ data_files: zh.json
38
+ language:
39
+ - en
40
+ - ca
41
+ - de
42
+ - es
43
+ - el
44
+ - fa
45
+ - fi
46
+ - fr
47
+ - it
48
+ - pl
49
+ - pt
50
+ - ru
51
+ - sv
52
+ - uk
53
+ - zh
54
+ tags:
55
+ - synthetic
56
+ ---
57
+
58
+ # Multilingual Phonemes 10K Alpha
59
+
60
+
61
+ This dataset contains approximately 10,000 pairs of text and phonemes from each supported language. We support 15 languages in this dataset, so we have a total of ~150K pairs. This does not include the English-XL dataset, which includes another 100K unique rows.
62
+
63
+ ## Languages
64
+
65
+ We support 15 languages, which means we have around 150,000 pairs of text and phonemes in multiple languages. This excludes the English-XL dataset, which has 100K unique (not included in any other split) additional phonemized pairs.
66
+
67
+ * English (en)
68
+ * English-XL (en-xl): ~100K phonemized pairs, English-only
69
+ * Catalan (ca)
70
+ * German (de)
71
+ * Spanish (es)
72
+ * Greek (el)
73
+ * Persian (fa): Requested by [@Respair](https://huggingface.co/Respair)
74
+ * Finnish (fi)
75
+ * French (fr)
76
+ * Italian (it)
77
+ * Polish (pl)
78
+ * Portuguese (pt)
79
+ * Russian (ru)
80
+ * Swedish (sw)
81
+ * Ukrainian (uk)
82
+ * Chinese (zh): Thank you to [@eugenepentland](https://huggingface.co/eugenepentland) for assistance in processing this text, as East-Asian languages are the most compute-intensive!
83
+
84
+ ## License + Credits
85
+
86
+ Source data comes from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and is licensed under CC-BY-SA 3.0. This dataset is licensed under CC-BY-SA 3.0.
87
+
88
+ ## Processing
89
+
90
+ We utilized the following process to preprocess the dataset:
91
+
92
+ 1. Download data from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) by language, selecting only the first Parquet file and naming it with the language code
93
+ 2. Process using [Data Preprocessing Scripts (StyleTTS 2 Community members only)](https://huggingface.co/styletts2-community/data-preprocessing-scripts) and modify the code to work with the language
94
+ 3. Script: Clean the text
95
+ 4. Script: Remove ultra-short phrases
96
+ 5. Script: Phonemize
97
+ 6. Script: Save JSON
98
+ 7. Upload dataset
99
+
100
+ ## Note
101
+
102
+ East-Asian languages are experimental. We do not distinguish between Traditional and Simplified Chinese. The dataset consists mainly of Simplified Chinese in the `zh` split. We recommend converting characters to Simplified Chinese during inference, using a library such as `hanziconv` or `chinese-converter`.
datasets/multilingual-phonemes-10k-alpha/ca.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:993ac7b508efb60c009c97861941676699549a82828898d214e40b0d43c459ab
3
+ size 10888235
datasets/multilingual-phonemes-10k-alpha/de.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/el.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8525fbdbfce41ad9bd29b390c49854edcdd942a35e70631bae52e3210cb49a5
3
+ size 12178762
datasets/multilingual-phonemes-10k-alpha/en-xl.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56856fb3ef4e8eb9d6335c8199d51bb40eee4affbb8d344cc76e317fc72b8d45
3
+ size 84063713
datasets/multilingual-phonemes-10k-alpha/en.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/es.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/fa.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f33e05a4a5e3ee0104e0599c4830d4ff505debc5d77eaab5b989e9d8d1eda23e
3
+ size 18121401
datasets/multilingual-phonemes-10k-alpha/fi.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/fr.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/it.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/languages.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ English
2
+ Catalan
3
+ German
4
+ Spanish
5
+ Greek
6
+ Persian
7
+ Finnish
8
+ French
9
+ Italian
10
+ Polish
11
+ Portuguese
12
+ Russian
13
+ Swedish
14
+ Ukrainian
15
+ Chinese
datasets/multilingual-phonemes-10k-alpha/pl.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/pt.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:458b4ec76cb331fbcdd1da6f83ccf0277fed2b0866a6c0ac25ccc47f1f4c9fac
3
+ size 11076509
datasets/multilingual-phonemes-10k-alpha/ru.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29cea02458a6e82bcfe35a30b12ebf29031200c26b2c4c621b76d975e3cfd27e
3
+ size 15753792
datasets/multilingual-phonemes-10k-alpha/source.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://huggingface.co/datasets/styletts2-community/multilingual-phonemes-10k-alpha
datasets/multilingual-phonemes-10k-alpha/sv.json ADDED
The diff for this file is too large to render. See raw diff
 
datasets/multilingual-phonemes-10k-alpha/uk.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4965c576559b6542a8b4127b1b13e81913b736f4eb6852b494dfbf7010866287
3
+ size 13127182
datasets/multilingual-phonemes-10k-alpha/zh.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6f6acdd4b4e05158361e2ddc1621172355a8b9b330f06a3c0873052d3ca44fc
3
+ size 20594478
models/ar/StyleTTS2-LibriTTS-arabic/.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ synthesized_audio.wav filter=lfs diff=lfs merge=lfs -text
models/ar/StyleTTS2-LibriTTS-arabic/README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ tags:
4
+ - text-to-speech
5
+ - tts
6
+ - arabic
7
+ - styletts2
8
+ - pl-bert
9
+ license: mit
10
+ hardware: H100
11
+ ---
12
+
13
+ # Model Card for Arabic StyleTTS2
14
+
15
+ This is an Arabic text-to-speech model based on StyleTTS2 architecture, specifically adapted for Arabic language synthesis. The model achieves good quality Arabic speech synthesis, though not yet state-of-the-art, and further experimentation is needed to optimize performance for Arabic language specifically. All training objectives from the original StyleTTS2 were maintained, except for the WavLM objectives which were removed as they were primarily designed for English speech.
16
+
17
+ ## Example
18
+
19
+ Here is an example output from the model:
20
+
21
+ #### Sample 1
22
+ <audio controls>
23
+ <source src="https://huggingface.co/fadi77/StyleTTS2-LibriTTS-arabic/resolve/main/synthesized_audio.wav" type="audio/wav">
24
+ Your browser does not support the audio element.
25
+ </audio>
26
+
27
+ ## Efficiency and Performance
28
+
29
+ A key strength of this model lies in its efficiency and performance characteristics:
30
+
31
+ - **Compact Architecture**: Achieves impressive quality with <100M parameters
32
+ - **Limited Training Data**: Trained on only 22 hours of single-speaker audio
33
+ - **Transfer Learning**: Successfully fine-tuned from LibriTTS multi-speaker model to single-speaker Arabic
34
+ - **Resource Efficient**: Good quality achieved despite limited computational resources
35
+
36
+ Note: According to the StyleTTS2 authors, performance should improve further when training a single-speaker model from scratch rather than fine-tuning. This wasn't attempted in our case due to computational resource constraints, suggesting potential for even better results with more extensive training.
37
+
38
+
39
+ ## Model Details
40
+
41
+ ### Model Description
42
+
43
+ This model is a modified version of StyleTTS2, specifically adapted for Arabic text-to-speech synthesis. It incorporates a custom-trained PL-BERT model for Arabic language understanding and removes the WavLM adversarial training component (which was primarily designed for English).
44
+
45
+ - **Developed by:** Fadi (GitHub: Fadi987)
46
+ - **Model type:** Text-to-Speech (StyleTTS2 architecture)
47
+ - **Language(s):** Arabic
48
+ - **Finetuned from model:** [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)
49
+
50
+ ### Model Sources
51
+
52
+ - **Repository:** [Fadi987/StyleTTS2](https://github.com/Fadi987/StyleTTS2)
53
+ - **Paper:** [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models](https://arxiv.org/abs/2306.07691)
54
+ - **PL-BERT Model:** [fadi77/pl-bert](https://huggingface.co/fadi77/pl-bert)
55
+
56
+ ## Uses
57
+
58
+ ### Direct Use
59
+
60
+ The model can be used for generating Arabic speech from text. To use the model:
61
+
62
+ 1. Clone the StyleTTS2 repository:
63
+ ```bash
64
+ git clone https://github.com/Fadi987/StyleTTS2
65
+ cd StyleTTS2
66
+ ```
67
+
68
+ 2. Install `espeak-ng` for phonemization backend:
69
+ ```bash
70
+ # For macOS
71
+ brew install espeak-ng
72
+
73
+ # For Ubuntu/Debian
74
+ sudo apt-get install espeak-ng
75
+
76
+ # For Windows
77
+ # Download and install espeak-ng from: https://github.com/espeak-ng/espeak-ng/releases
78
+ ```
79
+
80
+ 3. Install Python dependencies:
81
+ ```bash
82
+ pip install -r requirements.txt
83
+ ```
84
+
85
+ 4. Download the `model.pth` and `config.yml` files from this repository
86
+
87
+ 5. Run inference using:
88
+ ```bash
89
+ python inference.py --config config.yml --model model.pth --text "الإِتْقَانُ يَحْتَاجُ إِلَى الْعَمَلِ وَالْمُثَابَرَة"
90
+ ```
91
+
92
+ Make sure use properly diacritized Arabic text for best results
93
+
94
+ ### Out-of-Scope Use
95
+
96
+ The model is specifically designed for Arabic text-to-speech synthesis and may not perform well for:
97
+ - Other languages
98
+ - Heavy dialect variations
99
+ - Non-diacritized Arabic text
100
+
101
+ ## Training Details
102
+
103
+ ### Training Data
104
+
105
+ - Training was performed on approximately 22 hours of Arabic audiobook data
106
+ - Dataset: [fadi77/arabic-audiobook-dataset-24khz](https://huggingface.co/datasets/fadi77/arabic-audiobook-dataset-24khz)
107
+ - The PL-BERT component was trained on fully diacritized Wikipedia Arabic text
108
+
109
+ ### Training Hyperparameters
110
+
111
+ - **Number of epochs:** 20
112
+ - **Diffusion training:** Started from epoch 5
113
+
114
+ ### Objectives
115
+ - **Training objectives:** All original StyleTTS2 objectives maintained, except WavLM adversarial training
116
+ - **Validation objectives:** Identical to original StyleTTS2 validation process
117
+
118
+ ### Compute Infrastructure
119
+ - **Hardware Type:** NVIDIA H100 GPU
120
+
121
+ ### Notable Modifications from Original StyleTTS2 in Architecture and Objectives
122
+ The architecture of the model follows that of StyleTTS2 with the following exceptions:
123
+ - Removed WavLM adversarial training component
124
+ - Custom PL-BERT trained for Arabic language
125
+
126
+
127
+ ## Citation
128
+
129
+ **BibTeX:**
130
+ ```bibtex
131
+ @article{styletts2,
132
+ title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
133
+ author={Liu, Yinghao Aaron and Chen, Tao and Ping, Wei and Wu, Xiaoliang and Wang, Dongchao and Duan, Yuxuan and Li, Xiaodi and Li, Chong and Liang, Xuchen and Liu, Qiong and others},
134
+ journal={arXiv preprint arXiv:2306.07691},
135
+ year={2023}
136
+ }
137
+ ```
138
+
139
+ ## Model Card Contact
140
+
141
+ GitHub: [@Fadi987](https://github.com/Fadi987)
142
+ Hugging Face: [@fadi77](https://huggingface.co/fadi77)
models/ar/StyleTTS2-LibriTTS-arabic/config.yml ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ log_dir: "/style_tts2/Models/FineTune.AudioBook"
2
+ log_interval: 10
3
+ device: "cuda"
4
+ epochs: 25 # number of finetuning epoch
5
+ batch_size: 6
6
+ max_len: 300 # maximum number of frames
7
+ pretrained_model_repo: "yl4579/StyleTTS2-LibriTTS"
8
+ pretrained_model_filename: "Models/LibriTTS/epochs_2nd_00020.pth"
9
+ second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
10
+ load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
11
+
12
+ F0_path: "/root/Utils/JDC/bst.t7"
13
+ ASR_config: "/root/Utils/ASR/config.yml"
14
+ ASR_path: "/root/Utils/ASR/epoch_00080.pth"
15
+ PLBERT_repo_id: "fadi77/pl-bert"
16
+ PLBERT_dirname: "models/mlm_only_with_diacritics"
17
+
18
+ data_params:
19
+ train_data: "Data/youtube_train_list.txt"
20
+ val_data: "Data/youtube_val_list.txt"
21
+ root_path: "Youtube/wavs"
22
+ OOD_data: "Data/youtube_train_list.txt"
23
+ min_length: 50 # sample until texts with this size are obtained for OOD texts
24
+
25
+ preprocess_params:
26
+ sr: 24000
27
+ spect_params:
28
+ n_fft: 2048
29
+ win_length: 1200
30
+ hop_length: 300
31
+
32
+ model_params:
33
+ multispeaker: false
34
+
35
+ dim_in: 64
36
+ hidden_dim: 512
37
+ max_conv_dim: 512
38
+ n_layer: 3
39
+ n_mels: 80
40
+
41
+ n_token: 178 # number of phoneme tokens
42
+ max_dur: 50 # maximum duration of a single phoneme
43
+ style_dim: 128 # style vector size
44
+
45
+ dropout: 0.2
46
+
47
+ # config for decoder
48
+ decoder:
49
+ type: 'hifigan' # either hifigan or istftnet
50
+ resblock_kernel_sizes: [3,7,11]
51
+ upsample_rates : [10,5,3,2]
52
+ upsample_initial_channel: 512
53
+ resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
54
+ upsample_kernel_sizes: [20,10,6,4]
55
+
56
+ # speech language model config
57
+ slm:
58
+ model: 'microsoft/wavlm-base-plus'
59
+ sr: 16000 # sampling rate of SLM
60
+ hidden: 768 # hidden size of SLM
61
+ nlayers: 13 # number of layers of SLM
62
+ initial_channel: 64 # initial channels of SLM discriminator head
63
+
64
+ # style diffusion model config
65
+ diffusion:
66
+ embedding_mask_proba: 0.1
67
+ # transformer config
68
+ transformer:
69
+ num_layers: 3
70
+ num_heads: 8
71
+ head_features: 64
72
+ multiplier: 2
73
+
74
+ # diffusion distribution config
75
+ dist:
76
+ sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
77
+ estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
78
+ mean: -3.0
79
+ std: 1.0
80
+
81
+ loss_params:
82
+ lambda_mel: 5. # mel reconstruction loss
83
+ lambda_gen: 1. # generator loss
84
+ lambda_slm: 1. # slm feature matching loss
85
+
86
+ lambda_mono: 1. # monotonic alignment loss (TMA)
87
+ lambda_s2s: 1. # sequence-to-sequence loss (TMA)
88
+
89
+ lambda_F0: 1. # F0 reconstruction loss
90
+ lambda_norm: 1. # norm reconstruction loss
91
+ lambda_dur: 1. # duration loss
92
+ lambda_ce: 20. # duration predictor probability output CE loss
93
+ lambda_sty: 1. # style reconstruction loss
94
+ lambda_diff: 1. # score matching loss
95
+
96
+ # Note: Current values for training are only adequate for second stage finetuning.
97
+ diffusion_training_epoch: 5
98
+ joint_training_epoch: 100
99
+
100
+ # Note: Current values for learnings rates are very low. This is only adequate for second stage finetuning.
101
+ optimizer_params:
102
+ lr: 0.0001 # general learning rate
103
+ bert_lr: 0.00001 # learning rate for PLBERT
104
+ ft_lr: 0.0001 # learning rate for acoustic modules
105
+
106
+ slmadv_params:
107
+ min_len: 400 # minimum length of samples
108
+ max_len: 500 # maximum length of samples
109
+ batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
110
+ skip_update: 10 # update the discriminator every this iterations of generator update
111
+ thresh: 5 # gradient norm above which the gradient is scaled
112
+ scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
113
+ sig: 1.5 # sigma for differentiable duration modeling
114
+
models/ar/StyleTTS2-LibriTTS-arabic/model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59d2323412f0c55c774b5675b45e5c12659c0d9e0f9e7012eecc6b7dd845b132
3
+ size 2201968238
models/ar/StyleTTS2-LibriTTS-arabic/source.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://huggingface.co/fadi77/StyleTTS2-LibriTTS-arabic
models/ar/StyleTTS2-LibriTTS-arabic/synthesized_audio.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f60e90523d734eff1b9f4b95cca49f22277df5cb4acd0bd347fde18f1c3b0469
3
+ size 1795058
models/en/StyleTTS2-LibriTTS/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
models/en/StyleTTS2-LibriTTS/Models/config.yml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {ASR_config: Utils/ASR/config.yml, ASR_path: Utils/ASR/epoch_00080.pth, F0_path: Utils/JDC/bst.t7,
2
+ PLBERT_dir: Utils/PLBERT/, batch_size: 8, data_params: {OOD_data: Data/OOD_texts.txt,
3
+ min_length: 50, root_path: '', train_data: Data/train_list.txt, val_data: Data/val_list.txt},
4
+ device: cuda, epochs_1st: 40, epochs_2nd: 25, first_stage_path: first_stage.pth,
5
+ load_only_params: false, log_dir: Models/LibriTTS, log_interval: 10, loss_params: {
6
+ TMA_epoch: 4, diff_epoch: 0, joint_epoch: 0, lambda_F0: 1.0, lambda_ce: 20.0,
7
+ lambda_diff: 1.0, lambda_dur: 1.0, lambda_gen: 1.0, lambda_mel: 5.0, lambda_mono: 1.0,
8
+ lambda_norm: 1.0, lambda_s2s: 1.0, lambda_slm: 1.0, lambda_sty: 1.0}, max_len: 300,
9
+ model_params: {decoder: {resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3,
10
+ 5]], resblock_kernel_sizes: [3, 7, 11], type: hifigan, upsample_initial_channel: 512,
11
+ upsample_kernel_sizes: [20, 10, 6, 4], upsample_rates: [10, 5, 3, 2]}, diffusion: {
12
+ dist: {estimate_sigma_data: true, mean: -3.0, sigma_data: 0.19926648961191362,
13
+ std: 1.0}, embedding_mask_proba: 0.1, transformer: {head_features: 64, multiplier: 2,
14
+ num_heads: 8, num_layers: 3}}, dim_in: 64, dropout: 0.2, hidden_dim: 512,
15
+ max_conv_dim: 512, max_dur: 50, multispeaker: true, n_layer: 3, n_mels: 80, n_token: 178,
16
+ slm: {hidden: 768, initial_channel: 64, model: microsoft/wavlm-base-plus, nlayers: 13,
17
+ sr: 16000}, style_dim: 128}, optimizer_params: {bert_lr: 1.0e-05, ft_lr: 1.0e-05,
18
+ lr: 0.0001}, preprocess_params: {spect_params: {hop_length: 300, n_fft: 2048,
19
+ win_length: 1200}, sr: 24000}, pretrained_model: Models/LibriTTS/epoch_2nd_00002.pth,
20
+ save_freq: 1, second_stage_load_pretrained: true, slmadv_params: {batch_percentage: 0.5,
21
+ iter: 20, max_len: 500, min_len: 400, scale: 0.01, sig: 1.5, thresh: 5}}
models/en/StyleTTS2-LibriTTS/Models/epochs_2nd_00020.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1164ffe19a17449d2c722234cecaf2836b35a698fb8ffd42562d2663657dca0a
3
+ size 771390526
models/en/StyleTTS2-LibriTTS/README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - multilingual
5
+ tags:
6
+ - text-to-speech
7
+ - speech-synthesis
8
+ - pytorch
9
+ - styletts2
10
+ - speaches
11
+ - neural-tts
12
+ - voice-cloning
13
+ pipeline_tag: text-to-speech
14
+ library_name: pytorch
15
+ license: mit
16
+ datasets:
17
+ - LibriTTS
18
+ metrics:
19
+ - naturalness
20
+ - similarity
21
+ widget:
22
+ - text: "Hello, this is a sample of StyleTTS2 speech synthesis."
23
+ example_title: "English Sample"
24
+ - text: "StyleTTS2 can synthesize high-quality speech with style control."
25
+ example_title: "Style Control Sample"
26
+ ---
27
+
28
+ # StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training
29
+
30
+ StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level text-to-speech synthesis. This model builds upon the original StyleTTS with significant improvements in naturalness and similarity.
31
+
32
+ ## Model Description
33
+
34
+ - **Model Type**: Neural Text-to-Speech (TTS)
35
+ - **Language(s)**: English (primary), with support for 18+ languages
36
+ - **License**: MIT
37
+ - **Paper**: [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training](https://arxiv.org/abs/2306.07691)
38
+ - **Sample Rate**: 24,000 Hz
39
+ - **Architecture**: Style diffusion with adversarial training
40
+
41
+ ## Features
42
+
43
+ - **High-Quality Synthesis**: Achieves human-level naturalness in speech synthesis
44
+ - **Style Control**: Advanced style transfer and voice cloning capabilities
45
+ - **Multi-Language Support**: Primary English model with support for 18+ additional languages
46
+ - **Voice Cloning**: Can clone voices from reference audio samples
47
+ - **Diffusion-Based**: Uses diffusion models for high-quality audio generation
48
+
49
+ ## Usage
50
+
51
+ This model is designed for text-to-speech synthesis with the following capabilities:
52
+
53
+ 1. **Multi-Voice Synthesis**: Generate speech using preset voice styles
54
+ 2. **Voice Cloning**: Clone voices from reference audio samples
55
+ 3. **Style Control**: Fine-tune synthesis parameters for different styles
56
+ 4. **Multi-Language**: Support for various languages with English-accented pronunciation
57
+
58
+ ### Parameters
59
+
60
+ - `alpha` (0.0-1.0): Style blending factor (default: 0.3)
61
+ - `beta` (0.0-1.0): Style mixing factor (default: 0.7)
62
+ - `diffusion_steps` (3-20): Number of diffusion steps for quality (default: 5)
63
+ - `embedding_scale` (1.0-10.0): Embedding scale factor (default: 1.0)
64
+
65
+ ## Training Data
66
+
67
+ - **Primary Dataset**: LibriTTS
68
+ - **Languages**: English (primary) + 18 additional languages
69
+ - **Training Approach**: Style diffusion with adversarial training using large speech language models
70
+
71
+ ## Performance
72
+
73
+ StyleTTS 2 achieves human-level performance in:
74
+ - **Naturalness**: Comparable to human speech in listening tests
75
+ - **Similarity**: High fidelity voice cloning and style transfer
76
+ - **Quality**: Superior audio quality compared to previous TTS models
77
+
78
+ ## Limitations
79
+
80
+ - **Compute Requirements**: Requires significant computational resources for inference
81
+ - **English-First**: Optimized for English, other languages may have accented pronunciation
82
+ - **Context Dependency**: Performance varies with input text length and complexity
83
+
84
+ ## Citation
85
+
86
+ ```bibtex
87
+ @article{li2024styletts2,
88
+ title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
89
+ author={Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
90
+ journal={arXiv preprint arXiv:2306.07691},
91
+ year={2024}
92
+ }
93
+ ```
94
+
95
+ ## Links
96
+
97
+ - Paper: [https://arxiv.org/abs/2306.07691](https://arxiv.org/abs/2306.07691)
98
+ - Samples: [https://styletts2.github.io/](https://styletts2.github.io/)
99
+ - Code: [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
100
+ - License: MIT License
models/en/StyleTTS2-LibriTTS/source.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://huggingface.co/jakezp/StyleTTS2-LibriTTS