| --- |
| language: eu |
| license: apache-2.0 |
| tags: |
| - text-to-speech |
| - basque |
| - styletts2 |
| - multispeaker |
| --- |
| |
| # StyleTTS2 — Basque Multispeaker TTS |
|
|
| This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, specifically adapted for Basque language synthesis. The model achieves good-quality Basque speech synthesis. The model was trained from scratch on the Basque multispeaker [Sonora](https://zenodo.org/records/17952596) speech corpus. |
| |
| Examples (playable): |
|
|
| - **Sample 1** — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da." |
|
|
| <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu/resolve/main/sample_antton.wav">Your browser does not support the audio element.</audio> |
|
|
| - **Sample 2** — "Herriko errekan bakarrik korrika." |
|
|
| <audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu/resolve/main/sample_maider.wav">Your browser does not support the audio element.</audio> |
|
|
| Main modifications: |
| - [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text. |
| - ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2. |
| - Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment. |
|
|
|
|
|
|
|
|
| ## Model details |
|
|
| | | | |
| |---|---| |
| | Architecture | StyleTTS2 (from scratch) | |
| | Language | Basque (`eu`) | |
| | Speakers | Multispeaker (two speakers) | |
| | Text input | Basque IPA phonemes | |
| | Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) | |
| | Sample rate | 24 000 Hz | |
| | Decoder | HiFiGAN | |
|
|
| ## Training dataset |
|
|
| [Sonora](https://zenodo.org/records/17952596) multispeaker Basque speech dataset. |
| - Number of speakers: two speakers |
| - Audio: 13,500 utterances per speaker, totalling 34 hours and 18 minutes. |
| - Dataset split: We used 100 samples for validation and 500 for testing. |
| - OOD dataset: We use a different text dataset as the Out-of-Distribution (OOD) dataset. |
|
|
| ## Training |
|
|
| Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_800.yml`): |
|
|
| - **Device:** cuda |
| - **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30 |
| - **Batch:** batch_size = 2 |
| - **Max length:** max_len = 500 |
| - **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5 |
| - **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300) |
| - **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN |
| - **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2) |
| - **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0 |
| |
| |
| ## Files in this repository |
| |
| | File | Description | |
| |---|---| |
| | `config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` | |
| | `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` | |
| | `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` | |
| | `step_4000000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` | |
|
|
| > **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`. |
|
|
| ## Setup |
|
|
| ```bash |
| # 1. Clone the code repository |
| git clone https://github.com/AArriandiaga/StyleTTS2_basque |
| cd StyleTTS2_basque |
| |
| # 2. Install dependencies |
| pip install -r requirements.txt |
| |
| # 3. Download model weights from this HF repo and place them: |
| mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC |
| # Download bst.t7 from the original StyleTTS2 repo (not Basque-specific): |
| wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7 |
| |
| # using huggingface_hub: |
| python - <<'EOF' |
| from huggingface_hub import hf_hub_download |
| import shutil |
| |
| repo = "HiTZ/styletts2-basque" |
| files = { |
| "config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml", |
| "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth", |
| "epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth", |
| "step_4000000.t7": "Utils/PLBERT_phoneme/step_4000000.t7", |
| } |
| # bst.t7 comes from the original StyleTTS2 repo — download separately: |
| # https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC |
| for hf_name, local_path in files.items(): |
| src = hf_hub_download(repo_id=repo, filename=hf_name) |
| shutil.copy(src, local_path) |
| print(f"✓ {local_path}") |
| EOF |
| ``` |
|
|
| ## Inference |
|
|
| **CLI:** |
| ```bash |
| python inference.py \ |
| --config Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \ |
| --model Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \ |
| --ref Demo/ref_antton.wav \ |
| --text "Kaixo, zelan zaude?" \ |
| --output output/kaixo.wav |
| ``` |
|
|
| **Python API:** |
| ```python |
| from inference import Synthesizer |
| |
| synth = Synthesizer( |
| config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml', |
| checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth', |
| default_ref='Demo/ref_antton.wav', |
| ) |
| |
| wav = synth.run("Kaixo, zelan zaude?") |
| synth.save(wav, "output/kaixo.wav") |
| |
| # Different speaker |
| wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav') |
| synth.save(wav2, "output/arratsalde.wav") |
| ``` |
|
|
| Key parameters for `run()`: |
|
|
| | Parameter | Default | Description | |
| |---|---|---| |
| | `ref` | constructor default | Reference WAV for speaker style | |
| | `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) | |
| | `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) | |
| | `diffusion_steps` | 5 | Quality vs. speed trade-off | |
| | `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) | |
|
|
| ## Reference speakers |
|
|
| Two reference audios are included in the repo under `Demo/`: |
| - `ref_antton.wav` — male speaker |
| - `ref_maider.wav` — female speaker |
|
|
|
|
| All credit goes to the authors of StyleTTS2. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{li2023styletts2, |
| title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
| author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima}, |
| booktitle = {Advances in Neural Information Processing Systems}, |
| year = {2023}, |
| } |
| ``` |
|
|
| ## Additional Information |
|
|
|
|
| ### Author |
|
|
| Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU |
|
|
| ### Contact |
| For further information, please send an email to <inma.hernaez@ehu.eus>. |
|
|
| ### Copyright |
| Copyright(c) 2026 by Aholab, HiTZ. |
|
|
| ### License |
|
|
| [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
| ### Funding |
| This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. |
|
|