File size: 7,677 Bytes
df93da0 256069a df93da0 d435f07 df93da0 5e3b9c9 df93da0 5e3b9c9 df93da0 d435f07 df93da0 d435f07 df93da0 d435f07 df93da0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | ---
language: eu
license: apache-2.0
tags:
- text-to-speech
- basque
- styletts2
- multispeaker
---
# StyleTTS2 — Basque Multispeaker TTS
This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, specifically adapted for Basque language synthesis. The model achieves good-quality Basque speech synthesis. The model was trained from scratch on the Basque multispeaker [Sonora](https://zenodo.org/records/17952596) speech corpus.
Examples (playable):
- **Sample 1** — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu/resolve/main/sample_antton.wav">Your browser does not support the audio element.</audio>
- **Sample 2** — "Herriko errekan bakarrik korrika."
<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu/resolve/main/sample_maider.wav">Your browser does not support the audio element.</audio>
Main modifications:
- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2.
- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
## Model details
| | |
|---|---|
| Architecture | StyleTTS2 (from scratch) |
| Language | Basque (`eu`) |
| Speakers | Multispeaker (two speakers) |
| Text input | Basque IPA phonemes |
| Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
| Sample rate | 24 000 Hz |
| Decoder | HiFiGAN |
## Training dataset
[Sonora](https://zenodo.org/records/17952596) multispeaker Basque speech dataset.
- Number of speakers: two speakers
- Audio: 13,500 utterances per speaker, totalling 34 hours and 18 minutes.
- Dataset split: We used 100 samples for validation and 500 for testing.
- OOD dataset: We use a different text dataset as the Out-of-Distribution (OOD) dataset.
## Training
Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_800.yml`):
- **Device:** cuda
- **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
- **Batch:** batch_size = 2
- **Max length:** max_len = 500
- **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
- **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
- **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
- **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
- **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0
## Files in this repository
| File | Description |
|---|---|
| `config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
| `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
| `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
| `step_4000000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |
> **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.
## Setup
```bash
# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7
# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil
repo = "HiTZ/styletts2-basque"
files = {
"config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml",
"epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth",
"epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth",
"step_4000000.t7": "Utils/PLBERT_phoneme/step_4000000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
src = hf_hub_download(repo_id=repo, filename=hf_name)
shutil.copy(src, local_path)
print(f"✓ {local_path}")
EOF
```
## Inference
**CLI:**
```bash
python inference.py \
--config Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \
--model Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \
--ref Demo/ref_antton.wav \
--text "Kaixo, zelan zaude?" \
--output output/kaixo.wav
```
**Python API:**
```python
from inference import Synthesizer
synth = Synthesizer(
config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml',
checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth',
default_ref='Demo/ref_antton.wav',
)
wav = synth.run("Kaixo, zelan zaude?")
synth.save(wav, "output/kaixo.wav")
# Different speaker
wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav')
synth.save(wav2, "output/arratsalde.wav")
```
Key parameters for `run()`:
| Parameter | Default | Description |
|---|---|---|
| `ref` | constructor default | Reference WAV for speaker style |
| `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
| `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
| `diffusion_steps` | 5 | Quality vs. speed trade-off |
| `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |
## Reference speakers
Two reference audios are included in the repo under `Demo/`:
- `ref_antton.wav` — male speaker
- `ref_maider.wav` — female speaker
All credit goes to the authors of StyleTTS2.
## Citation
```bibtex
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023},
}
```
## Additional Information
### Author
Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
### Contact
For further information, please send an email to <inma.hernaez@ehu.eus>.
### Copyright
Copyright(c) 2026 by Aholab, HiTZ.
### License
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.
|