arrandi commited on
Commit
df93da0
·
1 Parent(s): 4839b9e

Initial upload: StyleTTS2 Basque multispeaker model

Browse files
.gitattributes CHANGED
@@ -23,13 +23,15 @@
23
  *.pth filter=lfs diff=lfs merge=lfs -text
24
  *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
  *.tar filter=lfs diff=lfs merge=lfs -text
 
29
  *.tflite filter=lfs diff=lfs merge=lfs -text
30
  *.tgz filter=lfs diff=lfs merge=lfs -text
31
  *.wasm filter=lfs diff=lfs merge=lfs -text
 
32
  *.xz filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
23
  *.pth filter=lfs diff=lfs merge=lfs -text
24
  *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ *.t7 filter=lfs diff=lfs merge=lfs -text
 
27
  *.tar filter=lfs diff=lfs merge=lfs -text
28
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
29
  *.tflite filter=lfs diff=lfs merge=lfs -text
30
  *.tgz filter=lfs diff=lfs merge=lfs -text
31
  *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.wav filter=lfs diff=lfs merge=lfs -text
33
  *.xz filter=lfs diff=lfs merge=lfs -text
34
  *.zip filter=lfs diff=lfs merge=lfs -text
35
  *.zst filter=lfs diff=lfs merge=lfs -text
36
  *tfevents* filter=lfs diff=lfs merge=lfs -text
37
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,194 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: eu
3
+ license: mit
4
+ tags:
5
+ - text-to-speech
6
+ - basque
7
+ - styletts2
8
+ - multispeaker
9
+ ---
10
+
11
+ # StyleTTS2 — Basque Multispeaker TTS
12
+
13
+ This is a BASQUE text-to-speech (TTS) model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, specifically adapted for Basque language synthesis. The model achieves good quality Basque speech synthesis. The mmodel was trained from scratch on Basque multispeaker [Sonora](https://zenodo.org/records/17952596) speech corpus.
14
+
15
+ Examples (playable):
16
+
17
+ - **Sample 1** — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da."
18
+
19
+ <audio controls src="sample_antton.wav">Your browser does not support the audio element.</audio>
20
+
21
+ - **Sample 2** — "Herriko errekan bakarrik korrika."
22
+
23
+ <audio controls src="sample_maider.wav">Your browser does not support the audio element.</audio>
24
+
25
+ Main modifications:
26
+ - [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
27
+ - ASR-eu: ASR model trained with a subset of multispeaker speech corpus. Same architecture as in the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2
28
+ - Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
29
+
30
+
31
+
32
+
33
+ ## Model details
34
+
35
+ | | |
36
+ |---|---|
37
+ | Architecture | StyleTTS2 (from scratch) |
38
+ | Language | Basque (`eu`) |
39
+ | Speakers | Multispeaker (two speakers) |
40
+ | Text input | Basque IPA phonemes |
41
+ | Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
42
+ | Sample rate | 24 000 Hz |
43
+ | Decoder | HiFiGAN |
44
+
45
+ ## Training dataset
46
+
47
+ [Sonora](https://zenodo.org/records/17952596) multispeaker Basque speech dataset.
48
+ - Number of speaker: two speakers
49
+ - Audios available: 13,500 utterances per speaker. A total of 34 hours and 18 minutes.
50
+ - Dataset division: We used 100 samples for validation and 500 for testing.
51
+ - OOD dataset: We use a different dataset text as Out-of-Distribution dataset
52
+
53
+ ## Training
54
+
55
+ Small summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_800.yml`):
56
+
57
+ - **Device:** cuda
58
+ - **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
59
+ - **Batch:** batch_size = 2
60
+ - **Max length:** max_len = 500
61
+ - **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
62
+ - **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
63
+ - **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
64
+ - **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
65
+ - **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0
66
+
67
+
68
+ ## Files in this repository
69
+
70
+ | File | Description |
71
+ |---|---|
72
+ | `config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
73
+ | `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
74
+ | `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
75
+ | `step_4000000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |
76
+
77
+ > **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.
78
+
79
+ ## Setup
80
+
81
+ ```bash
82
+ # 1. Clone the code repository
83
+ git clone https://github.com/AArriandiaga/StyleTTS2_basque
84
+ cd StyleTTS2_basque
85
+
86
+ # 2. Install dependencies
87
+ pip install -r requirements.txt
88
+
89
+ # 3. Download model weights from this HF repo and place them:
90
+ mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
91
+ # Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
92
+ wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7
93
+
94
+ # using huggingface_hub:
95
+ python - <<'EOF'
96
+ from huggingface_hub import hf_hub_download
97
+ import shutil
98
+
99
+ repo = "HiTZ/styletts2-basque"
100
+ files = {
101
+ "config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml",
102
+ "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth",
103
+ "epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth",
104
+ "step_4000000.t7": "Utils/PLBERT_phoneme/step_4000000.t7",
105
+ }
106
+ # bst.t7 comes from the original StyleTTS2 repo — download separately:
107
+ # https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
108
+ for hf_name, local_path in files.items():
109
+ src = hf_hub_download(repo_id=repo, filename=hf_name)
110
+ shutil.copy(src, local_path)
111
+ print(f"✓ {local_path}")
112
+ EOF
113
+ ```
114
+
115
+ ## Inference
116
+
117
+ **CLI:**
118
+ ```bash
119
+ python inference.py \
120
+ --config Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \
121
+ --model Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \
122
+ --ref Demo/ref_antton.wav \
123
+ --text "Kaixo, zelan zaude?" \
124
+ --output output/kaixo.wav
125
+ ```
126
+
127
+ **Python API:**
128
+ ```python
129
+ from inference import Synthesizer
130
+
131
+ synth = Synthesizer(
132
+ config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml',
133
+ checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth',
134
+ default_ref='Demo/ref_antton.wav',
135
+ )
136
+
137
+ wav = synth.run("Kaixo, zelan zaude?")
138
+ synth.save(wav, "output/kaixo.wav")
139
+
140
+ # Different speaker
141
+ wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav')
142
+ synth.save(wav2, "output/arratsalde.wav")
143
+ ```
144
+
145
+ Key parameters for `run()`:
146
+
147
+ | Parameter | Default | Description |
148
+ |---|---|---|
149
+ | `ref` | constructor default | Reference WAV for speaker style |
150
+ | `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
151
+ | `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
152
+ | `diffusion_steps` | 5 | Quality vs. speed trade-off |
153
+ | `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |
154
+
155
+ ## Reference speakers
156
+
157
+ Two reference audios are included in the repo under `Demo/`:
158
+ - `ref_antton.wav` — male speaker
159
+ - `ref_maider.wav` — female speaker
160
+
161
+
162
+ All credit goes to the authors of StyleTTS2.
163
+
164
+ ## Citation
165
+
166
+ ```bibtex
167
+ @inproceedings{li2023styletts2,
168
+ title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
169
+ author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
170
+ booktitle = {Advances in Neural Information Processing Systems},
171
+ year = {2023},
172
+ }
173
+ ```
174
+
175
+ ## Additional Information
176
+
177
+
178
+ ### Author
179
+
180
+ Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
181
+
182
+ ### Contact
183
+ For further information, please send an email to <inma.hernaez@ehu.eus>.
184
+
185
+ ### Copyright
186
+ Copyright(c) 2026 by Aholab, HiTZ.
187
+
188
+ ### License
189
+
190
+ [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
191
+
192
+
193
+ ### Funding
194
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.
config_basque_multispeaker_phoneme_wavlm.yml ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ log_dir: "Models/Basque_Multispeaker_Phoneme_wavlm_normal"
2
+ first_stage_path: "first_stage.pth"
3
+ save_freq: 1
4
+ log_interval: 10
5
+ device: "cuda"
6
+ epochs_1st: 50 # Standard schedule like original config.yml
7
+ epochs_2nd: 30 # Standard schedule like original config.yml
8
+ batch_size: 2 # MEMORY OPTIMIZATION
9
+ max_len: 500
10
+
11
+ pretrained_model: ""
12
+ second_stage_load_pretrained: false
13
+ load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
14
+
15
+ F0_path: "Utils/JDC/bst.t7"
16
+ ASR_config: "Utils/ASR_basque/config.yml"
17
+ ASR_path: "Utils/ASR_basque/epoch_00200.pth"
18
+ ASR_module: "ASR_basque"
19
+ PLBERT_dir: 'Utils/PLBERT_phoneme/'
20
+
21
+ # Wandb configuration
22
+ wandb:
23
+ project: "StyleTTS2-Basque"
24
+ group: "basque_multispeaker_phoneme_albert_wavlm_800"
25
+ tags: ["basque", "multispeaker", "phoneme", "albert", "wavlm", "max_len_800"]
26
+ notes: "Multispeaker config: AlBERT-phoneme + WavLM + short + max_len=800 (10s)"
27
+
28
+ data_params:
29
+ train_data: "Data/train_list_multispeaker.cleaned.txt"
30
+ val_data: "Data/val_list_multispeaker.cleaned.txt" # use the multispeaker validation split (<=8s recommended)
31
+ test_data: "Data/test_list_multispeaker.cleaned.txt"
32
+ root_path: "/scratch/anderarrigandiaga/data/tts/eu/sonora/"
33
+ OOD_data: "Data/OOD_eu.cleaned.txt" # optional OOD set (kept as example)
34
+ min_length: 50 # sample until texts with this size are obtained for OOD texts
35
+
36
+ preprocess_params:
37
+ sr: 24000
38
+ spect_params:
39
+ n_fft: 2048
40
+ win_length: 1200
41
+ hop_length: 300
42
+
43
+ model_params:
44
+ multispeaker: true
45
+
46
+ dim_in: 64
47
+ hidden_dim: 512
48
+ max_conv_dim: 512
49
+ n_layer: 3
50
+ n_mels: 80
51
+
52
+ n_token: 178 # number of phoneme tokens
53
+ max_dur: 50 # maximum duration of a single phoneme
54
+ style_dim: 128 # style vector size
55
+
56
+ dropout: 0.2
57
+
58
+ # config for decoder
59
+ decoder:
60
+ type: 'hifigan' # either hifigan or istftnet
61
+ resblock_kernel_sizes: [3,7,11]
62
+ upsample_rates: [10,5,3,2]
63
+ upsample_initial_channel: 512
64
+ resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
65
+ upsample_kernel_sizes: [20,10,6,4]
66
+
67
+ # speech language model config
68
+ slm:
69
+ model: 'microsoft/wavlm-base-plus'
70
+ sr: 16000 # sampling rate of SLM
71
+ hidden: 768 # hidden size of SLM
72
+ nlayers: 13 # number of layers of SLM
73
+ initial_channel: 64 # initial channels of SLM discriminator head
74
+
75
+ # style diffusion model config
76
+ diffusion:
77
+ embedding_mask_proba: 0.1
78
+ # transformer config
79
+ transformer:
80
+ num_layers: 3
81
+ num_heads: 8
82
+ head_features: 64
83
+ multiplier: 2
84
+
85
+ # diffusion distribution config
86
+ dist:
87
+ sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
88
+ estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
89
+ mean: -3.0
90
+ std: 1.0
91
+
92
+ loss_params:
93
+ lambda_mel: 5. # mel reconstruction loss
94
+ lambda_gen: 1. # generator loss
95
+ lambda_slm: 1. # slm feature matching loss
96
+
97
+ lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
98
+ lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
99
+ TMA_epoch: 5 # TMA starting epoch (1st stage)
100
+
101
+ lambda_F0: 1. # F0 reconstruction loss (2nd stage)
102
+ lambda_norm: 1. # norm reconstruction loss (2nd stage)
103
+ lambda_dur: 1. # duration loss (2nd stage)
104
+ lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
105
+ lambda_sty: 1. # style reconstruction loss (2nd stage)
106
+ lambda_diff: 1. # score matching loss (2nd stage)
107
+
108
+ diff_epoch: 10 # style diffusion starting epoch (2nd stage)
109
+ joint_epoch: 15 # joint training starting epoch (2nd stage)
110
+
111
+ optimizer_params:
112
+ lr: 0.0001 # general learning rate
113
+ bert_lr: 0.00001 # learning rate for PLBERT
114
+ ft_lr: 0.00001 # learning rate for acoustic modules
115
+
116
+ slmadv_params:
117
+ min_len: 400 # minimum length of samples
118
+ max_len: 500 # maximum length of samples
119
+ batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
120
+ iter: 20 # update the discriminator every this iterations of generator update
121
+ thresh: 5 # gradient norm above which the gradient is scaled
122
+ scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
123
+ sig: 1.5 # sigma for differentiable duration modeling
epoch_00200.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9df6632c8b1f7dd628696bf6326005422bfa5c4c49a74de5c59369fe7bf34056
3
+ size 94573449
epoch_2nd_00030.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb56c2bf275f9c60a052cc71412c1cc0752a0c2b744bbc9bae6a77e0a47c6f6c
3
+ size 2135548572
sample_antton.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:909d09a2a8454ff0a065f544f5307904eb3d72b993cdb2c55a67da129f94f6af
3
+ size 265144
sample_maider.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e4686b404895f174052859f55b6a4184fc9442c469b594782d39a76b1ba48bf
3
+ size 129544
step_4000000.t7 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd5f5e669db09e598da990fe4e8897128bd8f7ffa15b877151b15b7521565d4a
3
+ size 533867882