--- language: eu license: apache-2.0 tags: - text-to-speech - basque - styletts2 - multispeaker - emotional --- # StyleTTS2 — Basque Multispeaker Emotional TTS This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, adapted for **emotional Basque speech synthesis**. The model supports three emotional styles: neutral, happy (poza), and sad (tristura). Examples (playable): - **Sample 1 — Antton (Neutral)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita." - **Sample 1 — Antton (Happy)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita." - **Sample 1 — Antton (Sad)** — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita." - **Sample 2 — Maider (Neutral)** — "Gure patua hau izatea litekeena da, baina okerra deritzot." - **Sample 2 — Maider (Happy)** — "Gure patua hau izatea litekeena da, baina okerra deritzot." - **Sample 2 — Maider (Sad)** — "Gure patua hau izatea litekeena da, baina okerra deritzot." Main modifications: - [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text. - ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2. - Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment. ## Emotions The original [dataset](https://zenodo.org/records/18804769) contains four emotion categories. This model was trained on a subset of three emotions — Neutral, Happy, and Sad — as listed below. | Emotion | Basque Tag | Description | |---------|------------|-------------| | Neutral | `neu` | Neutral/calm delivery | | Happy | `poz` | Happy/expressive delivery (Poza) | | Sad | `tri` | Sad/contemplative delivery (Tristura) | ## Model details | | | |---|---| | Architecture | StyleTTS2 (from scratch) | | Language | Basque (`eu`) | | Speakers | Multispeaker (two speakers: Antton, Maider) | | Emotions | Neutral, Happy (Poza), Sad (Tristura) | | Text input | Basque IPA phonemes | | Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) | | Sample rate | 24 000 Hz | | Decoder | HiFiGAN | ## Training dataset [HiTZ-Aholab emotional speech synthesis dataset in Basque](https://zenodo.org/records/18804769) — emotional speech corpus. - **Number of speakers:** two (Antton, Maider) - **Audio:** 16,000 utterances per speaker, totalling approximately 43 hours and 58 minutes - Maider: ~21h 22min - Antton: ~22h 36min - **Emotions:** four categories (4,000 utterances per emotion per speaker) — Poza (joy), Haserre (anger), Harridura (surprise), Tristura (sadness) - *Note: although the dataset contains four emotions, this model was trained on a balanced subset of three: Neutral, Happy (Poza), Sad (Tristura) — with the same number of samples per emotion.* - **Dataset split:** 100 samples for validation, 600 for testing (300 per speaker) ## Training Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml`): - **Device:** cuda - **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30 - **Batch:** batch_size = 1 - **Max length:** max_len = 500 - **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5 - **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300) - **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN - **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2) - **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0 ## Files in this repository | File | Description | |---|---| | `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo_plbertemo_no_acc` | | `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo/` | | `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` | | `step_3580000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` | > **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`. ## Setup ```bash # 1. Clone the code repository git clone https://github.com/AArriandiaga/StyleTTS2_basque cd StyleTTS2_basque # 2. Install dependencies pip install -r requirements.txt # 3. Download model weights from this HF repo and place them: mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_emo Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC # Download bst.t7 from the original StyleTTS2 repo (not Basque-specific): wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7 # using huggingface_hub: python - <<'EOF' from huggingface_hub import hf_hub_download import shutil repo = "HiTZ/StyleTTS2-eu_emo" files = { "config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml", "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth", "epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth", "step_3580000.t7": "Utils/PLBERT_phoneme/step_3580000.t7", } # bst.t7 comes from the original StyleTTS2 repo — download separately: # https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC for hf_name, local_path in files.items(): src = hf_hub_download(repo_id=repo, filename=hf_name) shutil.copy(src, local_path) print(f"✓ {local_path}") EOF ``` ## Inference **CLI:** ```bash python inference.py \ --config Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml \ --model Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth \ --ref Demo/ref_antton_poz.wav \ --text "Kaixo, zelan zaude?" \ --output output/kaixo.wav ``` **Python API:** ```python from inference import Synthesizer synth = Synthesizer( config='Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml', checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth', default_ref='Demo/ref_antton_neu.wav', ) # Neutral emotion wav = synth.run("Kaixo, zelan zaude?", ref='Demo/ref_antton_neu.wav') synth.save(wav, "output/kaixo_neu.wav") # Happy emotion (using poza reference) wav2 = synth.run("Zorioneko gara!", ref='Demo/ref_antton_poz.wav') synth.save(wav2, "output/kaixo_poz.wav") # Sad emotion (using tristura reference) wav3 = synth.run("Hau oso tristea da.", ref='Demo/ref_antton_tri.wav') synth.save(wav3, "output/kaixo_tri.wav") ``` Key parameters for `run()`: | Parameter | Default | Description | |---|---|---| | `ref` | constructor default | Reference WAV for speaker & emotion style | | `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) | | `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) | | `diffusion_steps` | 5 | Quality vs. speed trade-off | | `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) | ## Reference speakers Six reference audios are included in the repo under `Demo/`, covering both speakers and all three emotions: | Speaker | Neutral | Happy | Sad | |---------|---------|-------|-----| | Antton (male) | `ref_antton_neu.wav` | `ref_antton_poz.wav` | `ref_antton_tri.wav` | | Maider (female) | `ref_maider_neu.wav` | `ref_maider_poz.wav` | `ref_maider_tri.wav` | All credit goes to the authors of StyleTTS2. ## Citation ```bibtex @inproceedings{li2023styletts2, title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima}, booktitle = {Advances in Neural Information Processing Systems}, year = {2023}, } ``` ## Additional Information ### Authors - [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (HiTZ), EHU - [Inmaculada Hernáez Rioja](mailto:inma.hernaez@ehu.eus) — Aholab (HiTZ), EHU ### Contact For further information, please send an email to . ### Copyright Copyright(c) 2026 by Aholab, HiTZ. ### License [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) ### Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.