Update README.md

1e2ac06 verified 3 days ago

10.2 kB

	---
	language: eu
	license: apache-2.0
	tags:
	- text-to-speech
	- basque
	- styletts2
	- multispeaker
	- emotional
	---

	# StyleTTS2 — Basque Multispeaker Emotional TTS

	This is a Basque text-to-speech (TTS) model based on the [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, adapted for emotional Basque speech synthesis. The model supports three emotional styles: neutral, happy (poza), and sad (tristura).

	Examples (playable):

	- Sample 1 — Antton (Neutral) — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."

	<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_neutral.wav">Your browser does not support the audio element.</audio>

	- Sample 1 — Antton (Happy) — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."

	<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_pozik.wav">Your browser does not support the audio element.</audio>

	- Sample 1 — Antton (Sad) — "Suhiltzaileek esan dutenez, hiru hildakoek jauzi egin zuten puzgarria su hartzen ari zela ikusita."

	<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_antton_triste.wav">Your browser does not support the audio element.</audio>

	- Sample 2 — Maider (Neutral) — "Gure patua hau izatea litekeena da, baina okerra deritzot."

	<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_neutral.wav">Your browser does not support the audio element.</audio>

	- Sample 2 — Maider (Happy) — "Gure patua hau izatea litekeena da, baina okerra deritzot."

	<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_pozik.wav">Your browser does not support the audio element.</audio>

	- Sample 2 — Maider (Sad) — "Gure patua hau izatea litekeena da, baina okerra deritzot."

	<audio controls src="https://huggingface.co/HiTZ/StyleTTS2-eu_emo/resolve/main/sample_maider_triste.wav">Your browser does not support the audio element.</audio>

	Main modifications:
	- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
	- ASR-eu: ASR model trained with a subset of the multispeaker speech corpus. It uses the same architecture as the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2.
	- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

	## Emotions

	The original [dataset](https://zenodo.org/records/18804769) contains four emotion categories. This model was trained on a subset of three emotions — Neutral, Happy, and Sad — as listed below.

	\| Emotion \| Basque Tag \| Description \|
	\|---------\|------------\|-------------\|
	\| Neutral \| `neu` \| Neutral/calm delivery \|
	\| Happy \| `poz` \| Happy/expressive delivery (Poza) \|
	\| Sad \| `tri` \| Sad/contemplative delivery (Tristura) \|

	## Model details

	\| \| \|
	\|---\|---\|
	\| Architecture \| StyleTTS2 (from scratch) \|
	\| Language \| Basque (`eu`) \|
	\| Speakers \| Multispeaker (two speakers: Antton, Maider) \|
	\| Emotions \| Neutral, Happy (Poza), Sad (Tristura) \|
	\| Text input \| Basque IPA phonemes \|
	\| Speech LM \| [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) \|
	\| Sample rate \| 24 000 Hz \|
	\| Decoder \| HiFiGAN \|

	## Training dataset

	[HiTZ-Aholab emotional speech synthesis dataset in Basque](https://zenodo.org/records/18804769) — emotional speech corpus.

	- Number of speakers: two (Antton, Maider)
	- Audio: 16,000 utterances per speaker, totalling approximately 43 hours and 58 minutes
	- Maider: ~21h 22min
	- Antton: ~22h 36min
	- Emotions: four categories (4,000 utterances per emotion per speaker) — Poza (joy), Haserre (anger), Harridura (surprise), Tristura (sadness)
	- Note: although the dataset contains four emotions, this model was trained on a balanced subset of three: Neutral, Happy (Poza), Sad (Tristura) — with the same number of samples per emotion.
	- Dataset split: 100 samples for validation, 600 for testing (300 per speaker)

	## Training

	Brief summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml`):

	- Device: cuda
	- Stages: 1st-stage epochs = 50; 2nd-stage epochs = 30
	- Batch: batch_size = 1
	- Max length: max_len = 500
	- Learning rates: lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
	- Audio / features: sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
	- Model: multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
	- Diffusion / schedule: diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
	- Loss highlights: lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0

	## Files in this repository

	\| File \| Description \|
	\|---\|---\|
	\| `config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml` \| Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo_plbertemo_no_acc` \|
	\| `epoch_2nd_00030.pth` \| Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_emo/` \|
	\| `epoch_00200.pth` \| Basque ASR / text aligner → place at `Utils/ASR_basque/` \|
	\| `step_3580000.t7` \| Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` \|

	> Note: The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.

	## Setup

	```bash
	# 1. Clone the code repository
	git clone https://github.com/AArriandiaga/StyleTTS2_basque
	cd StyleTTS2_basque

	# 2. Install dependencies
	pip install -r requirements.txt

	# 3. Download model weights from this HF repo and place them:
	mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_emo Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
	# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
	wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7

	# using huggingface_hub:
	python - <<'EOF'
	from huggingface_hub import hf_hub_download
	import shutil

	repo = "HiTZ/StyleTTS2-eu_emo"
	files = {
	"config_basque_multispeaker_phoneme_wavlm_emo_plbertemo_no_acc.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml",
	"epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth",
	"epoch_00200.pth": "Utils/ASR_basque/epoch_00200.pth",
	"step_3580000.t7": "Utils/PLBERT_phoneme/step_3580000.t7",
	}
	# bst.t7 comes from the original StyleTTS2 repo — download separately:
	# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
	for hf_name, local_path in files.items():
	src = hf_hub_download(repo_id=repo, filename=hf_name)
	shutil.copy(src, local_path)
	print(f"✓ {local_path}")
	EOF
	```

	## Inference

	CLI:
	```bash
	python inference.py \
	--config Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml \
	--model Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth \
	--ref Demo/ref_antton_poz.wav \
	--text "Kaixo, zelan zaude?" \
	--output output/kaixo.wav
	```

	Python API:
	```python
	from inference import Synthesizer

	synth = Synthesizer(
	config='Models/Basque_Multispeaker_Phoneme_wavlm_emo/config.yml',
	checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_emo/epoch_2nd_00030.pth',
	default_ref='Demo/ref_antton_neu.wav',
	)

	# Neutral emotion
	wav = synth.run("Kaixo, zelan zaude?", ref='Demo/ref_antton_neu.wav')
	synth.save(wav, "output/kaixo_neu.wav")

	# Happy emotion (using poza reference)
	wav2 = synth.run("Zorioneko gara!", ref='Demo/ref_antton_poz.wav')
	synth.save(wav2, "output/kaixo_poz.wav")

	# Sad emotion (using tristura reference)
	wav3 = synth.run("Hau oso tristea da.", ref='Demo/ref_antton_tri.wav')
	synth.save(wav3, "output/kaixo_tri.wav")
	```

	Key parameters for `run()`:

	\| Parameter \| Default \| Description \|
	\|---\|---\|---\|
	\| `ref` \| constructor default \| Reference WAV for speaker & emotion style \|
	\| `alpha` \| 0.3 \| Timbre mixing (0 = reference, 1 = sampled) \|
	\| `beta` \| 0.7 \| Prosody mixing (0 = reference, 1 = sampled) \|
	\| `diffusion_steps` \| 5 \| Quality vs. speed trade-off \|
	\| `embedding_scale` \| 1.0 \| Expressiveness (>1 = more expressive) \|

	## Reference speakers

	Six reference audios are included in the repo under `Demo/`, covering both speakers and all three emotions:

	\| Speaker \| Neutral \| Happy \| Sad \|
	\|---------\|---------\|-------\|-----\|
	\| Antton (male) \| `ref_antton_neu.wav` \| `ref_antton_poz.wav` \| `ref_antton_tri.wav` \|
	\| Maider (female) \| `ref_maider_neu.wav` \| `ref_maider_poz.wav` \| `ref_maider_tri.wav` \|

	All credit goes to the authors of StyleTTS2.

	## Citation

	```bibtex
	@inproceedings{li2023styletts2,
	title = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
	author = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
	booktitle = {Advances in Neural Information Processing Systems},
	year = {2023},
	}
	```

	## Additional Information

	### Authors

	- [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (HiTZ), EHU
	- [Inmaculada Hernáez Rioja](mailto:inma.hernaez@ehu.eus) — Aholab (HiTZ), EHU

	### Contact

	For further information, please send an email to <inma.hernaez@ehu.eus>.

	### Copyright

	Copyright(c) 2026 by Aholab, HiTZ.

	### License

	[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)

	### Funding

	This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.