File size: 10,170 Bytes
44b5270 5460b8d 44b5270 5460b8d 44b5270 5460b8d 44b5270 5460b8d 44b5270 1e2ac06 44b5270 1e2ac06 44b5270 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
language: eu
license: apache-2.0
tags:
- text-to-speech
- basque
- styletts2
- multispeaker
- emotional
---
# StyleTTS2 — Basque Multispeaker Emotional TTS
This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, adapted for **emotional Basque speech synthesis**. The model supports three emotional styles: neutral, happy (poza), and sad (tristura).
Examples (playable):
- **Sample 1 — Antton (Neutral)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_neutral.wav">Your browser does not support the audio element.</audio>
- **Sample 1 — Antton (Happy)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_pozik.wav">Your browser does not support the audio element.</audio>
- **Sample 1 — Antton (Sad)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_triste.wav">Your browser does not support the audio element.</audio>
- **Sample 2 — Maider (Neutral)** — "Gure patua hau izatea litekeena da, baina okerra deritzot."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_neutral.wav">Your browser does not support the audio element.</audio>
- **Sample 2 — Maider (Happy)** — "Gure patua hau izatea litekeena da, baina okerra deritzot."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_pozik.wav">Your browser does not support the audio element.</audio>
- **Sample 2 — Maider (Sad)** — "Gure patua hau izatea litekeena da, baina okerra deritzot."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_triste.wav">Your browser does not support the audio element.</audio>
Main modifications:
- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2.
- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
## Emotions
The original [dataset](https://zenodo.org/records/18804769) contains four emotion categories. This model was trained on a subset of three emotions — Neutral, Happy, and Sad — as listed below.
| Emotion | Basque Tag | Description |
|---------|------------|-------------|
| Neutral | `neu` | Neutral/calm delivery |
| Happy | `poz` | Happy/expressive delivery (Poza) |
| Sad | `tri` | Sad/contemplative delivery (Tristura) |
## Model details
| | |
|---|---|
| Architecture | StyleTTS2 (from scratch) |
| Language | Basque (`eu`) |
| Speakers | Multispeaker (two speakers: Antton, Maider) |
| Emotions | Neutral, Happy (Poza), Sad (Tristura) |
| Text input | Basque IPA phonemes |
| Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
| Sample rate | 24 000 Hz |
| Decoder | HiFiGAN |
## Training dataset
[HiTZ-Aholab emotional speech synthesis dataset in Basque](https://zenodo.org/records/18804769) — emotional speech corpus.
- **Number of speakers:** two (Antton, Maider)
- **Audio:** 16,000 utterances per speaker, totalling approximately 43 hours and 58 minutes
- Maider: ~21h 22min
- Antton: ~22h 36min
- **Emotions:** four categories (4,000 utterances per emotion per speaker) — Poza (joy), Haserre (anger), Harridura (surprise), Tristura (sadness)
- *Note: although the dataset contains four emotions, this model was trained on a balanced subset of three: Neutral, Happy (Poza), Sad (Tristura) — with the same number of samples per emotion.*
- **Dataset split:** 100 samples for validation, 600 for testing (300 per speaker)
## Training
Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml`):
- **Device:** cuda
- **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
- **Batch:** batch_size = 1
- **Max length:** max_len = 500
- **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
- **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
- **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
- **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
- **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0
## Files in this repository
| File | Description |
|---|---|
| `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo_plbertemo_no_acc` |
| `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo/` |
| `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
| `step_3580000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |
> **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.
## Setup
```bash
# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_emo Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7
# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil
repo = "HiTZ/StyleTTS2-eu_emo"
files = {
"config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml",
"epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth",
"epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth",
"step_3580000.t7": "Utils/PLBERT_phoneme/step_3580000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
src = hf_hub_download(repo_id=repo, filename=hf_name)
shutil.copy(src, local_path)
print(f"✓ {local_path}")
EOF
```
## Inference
**CLI:**
```bash
python inference.py \
--config Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml \
--model Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth \
--ref Demo/ref_antton_poz.wav \
--text "Kaixo, zelan zaude?" \
--output output/kaixo.wav
```
**Python API:**
```python
from inference import Synthesizer
synth = Synthesizer(
config='Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml',
checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth',
default_ref='Demo/ref_antton_neu.wav',
)
# Neutral emotion
wav = synth.run("Kaixo, zelan zaude?", ref='Demo/ref_antton_neu.wav')
synth.save(wav, "output/kaixo_neu.wav")
# Happy emotion (using poza reference)
wav2 = synth.run("Zorioneko gara!", ref='Demo/ref_antton_poz.wav')
synth.save(wav2, "output/kaixo_poz.wav")
# Sad emotion (using tristura reference)
wav3 = synth.run("Hau oso tristea da.", ref='Demo/ref_antton_tri.wav')
synth.save(wav3, "output/kaixo_tri.wav")
```
Key parameters for `run()`:
| Parameter | Default | Description |
|---|---|---|
| `ref` | constructor default | Reference WAV for speaker & emotion style |
| `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
| `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
| `diffusion_steps` | 5 | Quality vs. speed trade-off |
| `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |
## Reference speakers
Six reference audios are included in the repo under `Demo/`, covering both speakers and all three emotions:
| Speaker | Neutral | Happy | Sad |
|---------|---------|-------|-----|
| Antton (male) | `ref_antton_neu.wav` | `ref_antton_poz.wav` | `ref_antton_tri.wav` |
| Maider (female) | `ref_maider_neu.wav` | `ref_maider_poz.wav` | `ref_maider_tri.wav` |
All credit goes to the authors of StyleTTS2.
## Citation
```bibtex
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023},
}
```
## Additional Information
### Authors
- [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (HiTZ), EHU
- [Inmaculada Hernáez Rioja](mailto:inma.hernaez@ehu.eus) — Aholab (HiTZ), EHU
### Contact
For further information, please send an email to <inma.hernaez@ehu.eus>.
### Copyright
Copyright(c) 2026 by Aholab, HiTZ.
### License
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. |