File size: 7,677 Bytes
df93da0
 
256069a
df93da0
 
 
 
 
 
 
 
 
d435f07
df93da0
 
 
 
 
5e3b9c9
df93da0
 
 
5e3b9c9
df93da0
 
 
d435f07
df93da0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d435f07
 
 
 
df93da0
 
 
d435f07
df93da0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
language: eu
license: apache-2.0
tags:
  - text-to-speech
  - basque
  - styletts2
  - multispeaker
---

# StyleTTS2 — Basque Multispeaker TTS

This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, specifically adapted for Basque language synthesis. The model achieves good-quality Basque speech synthesis. The model was trained from scratch on the Basque multispeaker [Sonora](https://zenodo.org/records/17952596) speech corpus.
 
Examples (playable):

- **Sample 1** — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu/resolve/main/sample_antton.wav">Your browser does not support the audio element.</audio>

- **Sample 2** — "Herriko errekan bakarrik korrika."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu/resolve/main/sample_maider.wav">Your browser does not support the audio element.</audio>

Main modifications:
- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2.
- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.




## Model details

| | |
|---|---|
| Architecture | StyleTTS2 (from scratch) |
| Language | Basque (`eu`) |
| Speakers | Multispeaker (two speakers) |
| Text input | Basque IPA phonemes |
| Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
| Sample rate | 24 000 Hz |
| Decoder | HiFiGAN |

## Training dataset

[Sonora](https://zenodo.org/records/17952596) multispeaker Basque speech dataset. 
- Number of speakers: two speakers
- Audio: 13,500 utterances per speaker, totalling 34 hours and 18 minutes.
- Dataset split: We used 100 samples for validation and 500 for testing.
- OOD dataset: We use a different text dataset as the Out-of-Distribution (OOD) dataset.

## Training

Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_800.yml`):

- **Device:** cuda
- **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
- **Batch:** batch_size = 2 
- **Max length:** max_len = 500 
- **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
- **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
- **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
- **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
- **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0


## Files in this repository

| File | Description |
|---|---|
| `config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
| `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
| `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
| `step_4000000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |

> **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.

## Setup

```bash
# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7

# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil

repo = "HiTZ/styletts2-basque"
files = {
    "config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml",
    "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth",
    "epoch_00200.pth":     "Utils/ASR_basque/epoch_00200.pth",
    "step_4000000.t7":     "Utils/PLBERT_phoneme/step_4000000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
    src = hf_hub_download(repo_id=repo, filename=hf_name)
    shutil.copy(src, local_path)
    print(f"✓ {local_path}")
EOF
```

## Inference

**CLI:**
```bash
python inference.py \
    --config  Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \
    --model   Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \
    --ref     Demo/ref_antton.wav \
    --text    "Kaixo, zelan zaude?" \
    --output  output/kaixo.wav
```

**Python API:**
```python
from inference import Synthesizer

synth = Synthesizer(
    config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml',
    checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth',
    default_ref='Demo/ref_antton.wav',
)

wav = synth.run("Kaixo, zelan zaude?")
synth.save(wav, "output/kaixo.wav")

# Different speaker
wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav')
synth.save(wav2, "output/arratsalde.wav")
```

Key parameters for `run()`:

| Parameter | Default | Description |
|---|---|---|
| `ref` | constructor default | Reference WAV for speaker style |
| `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
| `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
| `diffusion_steps` | 5 | Quality vs. speed trade-off |
| `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |

## Reference speakers

Two reference audios are included in the repo under `Demo/`:
- `ref_antton.wav` — male speaker
- `ref_maider.wav` — female speaker


All credit goes to the authors of StyleTTS2.

## Citation

```bibtex
@inproceedings{li2023styletts2,
  title     = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author    = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2023},
}
```

## Additional Information


### Author

Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU

### Contact
For further information, please send an email to <inma.hernaez@ehu.eus>.

### Copyright
Copyright(c) 2026 by Aholab, HiTZ.

### License

[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)


### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.