---
language: eu
license: apache-2.0
tags:
- text-to-speech
- basque
- styletts2
- multispeaker
---
# StyleTTS2 — Basque Multispeaker TTS
This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, specifically adapted for Basque language synthesis. The model achieves good-quality Basque speech synthesis. The model was trained from scratch on the Basque multispeaker [Sonora](https://zenodo.org/records/17952596) speech corpus.
Examples (playable):
- **Sample 1** — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da."
- **Sample 2** — "Herriko errekan bakarrik korrika."
Main modifications:
- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2.
- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
## Model details
| | |
|---|---|
| Architecture | StyleTTS2 (from scratch) |
| Language | Basque (`eu`) |
| Speakers | Multispeaker (two speakers) |
| Text input | Basque IPA phonemes |
| Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
| Sample rate | 24 000 Hz |
| Decoder | HiFiGAN |
## Training dataset
[Sonora](https://zenodo.org/records/17952596) multispeaker Basque speech dataset.
- Number of speakers: two speakers
- Audio: 13,500 utterances per speaker, totalling 34 hours and 18 minutes.
- Dataset split: We used 100 samples for validation and 500 for testing.
- OOD dataset: We use a different text dataset as the Out-of-Distribution (OOD) dataset.
## Training
Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_800.yml`):
- **Device:** cuda
- **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
- **Batch:** batch_size = 2
- **Max length:** max_len = 500
- **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
- **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
- **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
- **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
- **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0
## Files in this repository
| File | Description |
|---|---|
| `config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
| `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
| `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
| `step_4000000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |
> **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.
## Setup
```bash
# 1. Clone the code repository
git clone https://github.com/AArriandiaga/StyleTTS2_basque
cd StyleTTS2_basque
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download model weights from this HF repo and place them:
mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7
# using huggingface_hub:
python - <<'EOF'
from huggingface_hub import hf_hub_download
import shutil
repo = "HiTZ/styletts2-basque"
files = {
"config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml",
"epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth",
"epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth",
"step_4000000.t7": "Utils/PLBERT_phoneme/step_4000000.t7",
}
# bst.t7 comes from the original StyleTTS2 repo — download separately:
# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
for hf_name, local_path in files.items():
src = hf_hub_download(repo_id=repo, filename=hf_name)
shutil.copy(src, local_path)
print(f"✓ {local_path}")
EOF
```
## Inference
**CLI:**
```bash
python inference.py \
--config Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \
--model Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \
--ref Demo/ref_antton.wav \
--text "Kaixo, zelan zaude?" \
--output output/kaixo.wav
```
**Python API:**
```python
from inference import Synthesizer
synth = Synthesizer(
config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml',
checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth',
default_ref='Demo/ref_antton.wav',
)
wav = synth.run("Kaixo, zelan zaude?")
synth.save(wav, "output/kaixo.wav")
# Different speaker
wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav')
synth.save(wav2, "output/arratsalde.wav")
```
Key parameters for `run()`:
| Parameter | Default | Description |
|---|---|---|
| `ref` | constructor default | Reference WAV for speaker style |
| `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
| `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
| `diffusion_steps` | 5 | Quality vs. speed trade-off |
| `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |
## Reference speakers
Two reference audios are included in the repo under `Demo/`:
- `ref_antton.wav` — male speaker
- `ref_maider.wav` — female speaker
All credit goes to the authors of StyleTTS2.
## Citation
```bibtex
@inproceedings{li2023styletts2,
title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
booktitle = {Advances in Neural Information Processing Systems},
year = {2023},
}
```
## Additional Information
### Author
Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
### Contact
For further information, please send an email to .
### Copyright
Copyright(c) 2026 by Aholab, HiTZ.
### License
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.