File size: 10,170 Bytes
44b5270
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5460b8d
 
44b5270
 
 
5460b8d
 
44b5270
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5460b8d
44b5270
 
5460b8d
 
 
 
 
44b5270
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e2ac06
44b5270
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e2ac06
44b5270
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
language: eu
license: apache-2.0
tags:
  - text-to-speech
  - basque
  - styletts2
  - multispeaker
  - emotional
---

# StyleTTS2 — Basque Multispeaker Emotional TTS

This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, adapted for **emotional Basque speech synthesis**. The model supports three emotional styles: neutral, happy (poza), and sad (tristura).

Examples (playable):

- **Sample 1 — Antton (Neutral)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_neutral.wav">Your browser does not support the audio element.</audio>

- **Sample 1 — Antton (Happy)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_pozik.wav">Your browser does not support the audio element.</audio>

- **Sample 1 — Antton (Sad)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_triste.wav">Your browser does not support the audio element.</audio>

- **Sample 2 — Maider (Neutral)** — "Gure patua hau izatea litekeena da, baina okerra deritzot."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_neutral.wav">Your browser does not support the audio element.</audio>

- **Sample 2 — Maider (Happy)** — "Gure patua hau izatea litekeena da, baina okerra deritzot."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_pozik.wav">Your browser does not support the audio element.</audio>

- **Sample 2 — Maider (Sad)** — "Gure patua hau izatea litekeena da, baina okerra deritzot."

  <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_triste.wav">Your browser does not support the audio element.</audio>

Main modifications:
- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2.
- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

## Emotions

The original [dataset](https://zenodo.org/records/18804769) contains four emotion categories. This model was trained on a subset of three emotions — Neutral, Happy, and Sad — as listed below.

| Emotion | Basque Tag | Description |
|---------|------------|-------------|
| Neutral | `neu` | Neutral/calm delivery |
| Happy | `poz` | Happy/expressive delivery (Poza) |
| Sad | `tri` | Sad/contemplative delivery (Tristura) |

## Model details

| | |
|---|---|
| Architecture | StyleTTS2 (from scratch) |
| Language | Basque (`eu`) |
| Speakers | Multispeaker (two speakers: Antton, Maider) |
| Emotions | Neutral, Happy (Poza), Sad (Tristura) |
| Text input | Basque IPA phonemes |
| Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
| Sample rate | 24 000 Hz |
| Decoder | HiFiGAN |

## Training dataset

[HiTZ-Aholab emotional speech synthesis dataset in Basque](https://zenodo.org/records/18804769) — emotional speech corpus.

- **Number of speakers:** two (Antton, Maider)
- **Audio:** 16,000 utterances per speaker, totalling approximately 43 hours and 58 minutes
  - Maider: ~21h 22min
  - Antton: ~22h 36min
- **Emotions:** four categories (4,000 utterances per emotion per speaker) — Poza (joy), Haserre (anger), Harridura (surprise), Tristura (sadness)
  - *Note: although the dataset contains four emotions, this model was trained on a balanced subset of three: Neutral, Happy (Poza), Sad (Tristura) — with the same number of samples per emotion.*
- **Dataset split:** 100 samples for validation, 600 for testing (300 per speaker)

## Training

Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml`):

- **Device:** cuda
- **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
- **Batch:** batch_size = 1
- **Max length:** max_len = 500
- **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
- **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
- **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
- **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
- **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0

## Files in this repository

| File | Description |
|---|---|
| `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo_plbertemo_no_acc` |
| `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo/` |
| `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
| `step_3580000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |

> **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.

## Setup

```bash
# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_emo Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7

# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil

repo = "HiTZ/StyleTTS2-eu_emo"
files = {
    "config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml",
    "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth",
    "epoch_00200.pth":     "Utils/ASR_basque/epoch_00200.pth",
    "step_3580000.t7":     "Utils/PLBERT_phoneme/step_3580000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
    src = hf_hub_download(repo_id=repo, filename=hf_name)
    shutil.copy(src, local_path)
    print(f"✓ {local_path}")
EOF
```

## Inference

**CLI:**
```bash
python inference.py \
    --config  Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml \
    --model   Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth \
    --ref     Demo/ref_antton_poz.wav \
    --text    "Kaixo, zelan zaude?" \
    --output  output/kaixo.wav
```

**Python API:**
```python
from inference import Synthesizer

synth = Synthesizer(
    config='Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml',
    checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth',
    default_ref='Demo/ref_antton_neu.wav',
)

# Neutral emotion
wav = synth.run("Kaixo, zelan zaude?", ref='Demo/ref_antton_neu.wav')
synth.save(wav, "output/kaixo_neu.wav")

# Happy emotion (using poza reference)
wav2 = synth.run("Zorioneko gara!", ref='Demo/ref_antton_poz.wav')
synth.save(wav2, "output/kaixo_poz.wav")

# Sad emotion (using tristura reference)
wav3 = synth.run("Hau oso tristea da.", ref='Demo/ref_antton_tri.wav')
synth.save(wav3, "output/kaixo_tri.wav")
```

Key parameters for `run()`:

| Parameter | Default | Description |
|---|---|---|
| `ref` | constructor default | Reference WAV for speaker & emotion style |
| `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
| `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
| `diffusion_steps` | 5 | Quality vs. speed trade-off |
| `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |

## Reference speakers

Six reference audios are included in the repo under `Demo/`, covering both speakers and all three emotions:

| Speaker | Neutral | Happy | Sad |
|---------|---------|-------|-----|
| Antton (male) | `ref_antton_neu.wav` | `ref_antton_poz.wav` | `ref_antton_tri.wav` |
| Maider (female) | `ref_maider_neu.wav` | `ref_maider_poz.wav` | `ref_maider_tri.wav` |

All credit goes to the authors of StyleTTS2.

## Citation

```bibtex
@inproceedings{li2023styletts2,
  title     = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
  author    = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2023},
}
```

## Additional Information

### Authors

- [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (HiTZ), EHU
- [Inmaculada Hernáez Rioja](mailto:inma.hernaez@ehu.eus) — Aholab (HiTZ), EHU

### Contact

For further information, please send an email to <inma.hernaez@ehu.eus>.

### Copyright

Copyright(c) 2026 by Aholab, HiTZ.

### License

[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)

### Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.