Initial upload: StyleTTS2 Basque multispeaker model

Browse files

Files changed (8) hide show

.gitattributes +4 -2
README.md +194 -3
config_basque_multispeaker_phoneme_wavlm.yml +123 -0
epoch_00200.pth +3 -0
epoch_2nd_00030.pth +3 -0
sample_antton.wav +3 -0
sample_maider.wav +3 -0
step_4000000.t7 +3 -0

.gitattributes CHANGED Viewed

@@ -23,13 +23,15 @@
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
+*.t7 filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,194 @@
----
-license: apache-2.0
----

+---
+language: eu
+license: mit
+tags:
+  - text-to-speech
+  - basque
+  - styletts2
+  - multispeaker
+---
+# StyleTTS2 — Basque Multispeaker TTS
+This is a BASQUE text-to-speech (TTS) model based on [StyleTTS2](https://github.com/yl4579/StyleTTS2) architecture, specifically adapted for Basque language synthesis. The model achieves good quality Basque speech synthesis. The mmodel was trained from scratch on Basque multispeaker [Sonora](https://zenodo.org/records/17952596) speech corpus.
+Examples (playable):
+- **Sample 1** — "Cesare Pavese XXI. mendeko idazle italiar esanguratzuenetakoa da."
+  <audio controls src="sample_antton.wav">Your browser does not support the audio element.</audio>
+- **Sample 2** — "Herriko errekan bakarrik korrika."
+  <audio controls src="sample_maider.wav">Your browser does not support the audio element.</audio>
+Main modifications:
+- [PL-BERT-eu](https://huggingface.co/HiTZ/PL-BERT-wp-eu): PL-BERT model trained with WordPiece tokenizer for phonemized Basque text.
+- ASR-eu: ASR model trained with a subset of multispeaker speech corpus. Same architecture as in the original [ASR](https://github.com/yl4579/AuxiliaryASR) from StyleTTS2
+- Phonemizer: We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
+## Model details
+| | |
+|---|---|
+| Architecture | StyleTTS2 (from scratch) |
+| Language | Basque (`eu`) |
+| Speakers | Multispeaker (two speakers) |
+| Text input | Basque IPA phonemes |
+| Speech LM | [WavLM-Base-Plus](https://huggingface.co/microsoft/wavlm-base-plus) |
+| Sample rate | 24 000 Hz |
+| Decoder | HiFiGAN |
+## Training dataset
+[Sonora](https://zenodo.org/records/17952596) multispeaker Basque speech dataset.
+- Number of speaker: two speakers
+- Audios available: 13,500 utterances per speaker. A total of 34 hours and 18 minutes.
+- Dataset division: We used 100 samples for validation and 500 for testing.
+- OOD dataset: We use a different dataset text as Out-of-Distribution dataset
+## Training
+Small summary of training parameters used (from `config_basque_multispeaker_phoneme_wavlm_800.yml`):
+- **Device:** cuda
+- **Stages:** 1st-stage epochs = 50; 2nd-stage epochs = 30
+- **Batch:** batch_size = 2
+- **Max length:** max_len = 500
+- **Learning rates:** lr = 0.0001; bert_lr = 1e-5; ft_lr = 1e-5
+- **Audio / features:** sr = 24000; n_mels = 80; spectrogram (n_fft=2048, win_length=1200, hop_length=300)
+- **Model:** multispeaker = true; n_token = 178 (phonemes); style_dim = 128; decoder = HiFiGAN
+- **Diffusion / schedule:** diff_epoch = 10; joint_epoch = 15; estimate_sigma_data = true (sigma ≈ 0.2)
+- **Loss highlights:** lambda_mel = 5.0; lambda_ce = 20.0; lambda_diff = 1.0
+## Files in this repository
+| File | Description |
+|---|---|
+| `config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml` | Training & model config → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
+| `epoch_2nd_00030.pth` | Main TTS checkpoint → place at `Models/Basque_Multispeaker_Phoneme_wavlm_normal/` |
+| `epoch_00200.pth` | Basque ASR / text aligner → place at `Utils/ASR_basque/` |
+| `step_4000000.t7` | Phoneme PLBERT → place at `Utils/PLBERT_phoneme/` |
+> **Note:** The JDC F0 extractor (`Utils/JDC/bst.t7`) is not Basque-specific — download it from the original [StyleTTS2 repository](https://github.com/yl4579/StyleTTS2) and place it at `Utils/JDC/bst.t7`.
+## Setup
+```bash
+# 1. Clone the code repository
+git clone https://github.com/AArriandiaga/StyleTTS2_basque
+cd StyleTTS2_basque
+# 2. Install dependencies
+pip install -r requirements.txt
+# 3. Download model weights from this HF repo and place them:
+mkdir -p Models/Basque_Multispeaker_Phoneme_wavlm_normal Utils/ASR_basque Utils/PLBERT_phoneme Utils/JDC
+# Download bst.t7 from the original StyleTTS2 repo (not Basque-specific):
+wget -P Utils/JDC https://github.com/yl4579/StyleTTS2/raw/main/Utils/JDC/bst.t7
+# using huggingface_hub:
+python - <<'EOF'
+from huggingface_hub import hf_hub_download
+import shutil
+repo = "HiTZ/styletts2-basque"
+files = {
+    "config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml",
+    "epoch_2nd_00030.pth": "Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth",
+    "epoch_00200.pth":     "Utils/ASR_basque/epoch_00200.pth",
+    "step_4000000.t7":     "Utils/PLBERT_phoneme/step_4000000.t7",
+}
+# bst.t7 comes from the original StyleTTS2 repo — download separately:
+# https://github.com/yl4579/StyleTTS2/tree/main/Utils/JDC
+for hf_name, local_path in files.items():
+    src = hf_hub_download(repo_id=repo, filename=hf_name)
+    shutil.copy(src, local_path)
+    print(f"✓ {local_path}")
+EOF
+```
+## Inference
+**CLI:**
+```bash
+python inference.py \
+    --config  Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml \
+    --model   Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth \
+    --ref     Demo/ref_antton.wav \
+    --text    "Kaixo, zelan zaude?" \
+    --output  output/kaixo.wav
+```
+**Python API:**
+```python
+from inference import Synthesizer
+synth = Synthesizer(
+    config='Models/Basque_Multispeaker_Phoneme_wavlm_normal/config_basque_multispeaker_phoneme_wavlm_800_2nd_normal.yml',
+    checkpoint='Models/Basque_Multispeaker_Phoneme_wavlm_normal/epoch_2nd_00030.pth',
+    default_ref='Demo/ref_antton.wav',
+)
+wav = synth.run("Kaixo, zelan zaude?")
+synth.save(wav, "output/kaixo.wav")
+# Different speaker
+wav2 = synth.run("Arratsalde on!", ref='Demo/ref_maider.wav')
+synth.save(wav2, "output/arratsalde.wav")
+```
+Key parameters for `run()`:
+| Parameter | Default | Description |
+|---|---|---|
+| `ref` | constructor default | Reference WAV for speaker style |
+| `alpha` | 0.3 | Timbre mixing (0 = reference, 1 = sampled) |
+| `beta` | 0.7 | Prosody mixing (0 = reference, 1 = sampled) |
+| `diffusion_steps` | 5 | Quality vs. speed trade-off |
+| `embedding_scale` | 1.0 | Expressiveness (>1 = more expressive) |
+## Reference speakers
+Two reference audios are included in the repo under `Demo/`:
+- `ref_antton.wav` — male speaker
+- `ref_maider.wav` — female speaker
+All credit goes to the authors of StyleTTS2.
+## Citation
+```bibtex
+@inproceedings{li2023styletts2,
+  title     = {StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
+  author    = {Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima},
+  booktitle = {Advances in Neural Information Processing Systems},
+  year      = {2023},
+}
+```
+## Additional Information
+### Author
+Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
+### Contact
+For further information, please send an email to <inma.hernaez@ehu.eus>.
+### Copyright
+Copyright(c) 2026 by Aholab, HiTZ.
+### License
+[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.

config_basque_multispeaker_phoneme_wavlm.yml ADDED Viewed

	@@ -0,0 +1,123 @@

+log_dir: "Models/Basque_Multispeaker_Phoneme_wavlm_normal"
+first_stage_path: "first_stage.pth"
+save_freq: 1
+log_interval: 10
+device: "cuda"
+epochs_1st: 50 # Standard schedule like original config.yml
+epochs_2nd: 30 # Standard schedule like original config.yml
+batch_size: 2   # MEMORY OPTIMIZATION
+max_len: 500
+pretrained_model: ""
+second_stage_load_pretrained: false
+load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
+F0_path: "Utils/JDC/bst.t7"
+ASR_config: "Utils/ASR_basque/config.yml"
+ASR_path: "Utils/ASR_basque/epoch_00200.pth"
+ASR_module: "ASR_basque"
+PLBERT_dir: 'Utils/PLBERT_phoneme/'
+# Wandb configuration
+wandb:
+  project: "StyleTTS2-Basque"
+  group: "basque_multispeaker_phoneme_albert_wavlm_800"
+  tags: ["basque", "multispeaker", "phoneme", "albert", "wavlm", "max_len_800"]
+  notes: "Multispeaker config: AlBERT-phoneme + WavLM + short + max_len=800 (10s)"
+data_params:
+  train_data: "Data/train_list_multispeaker.cleaned.txt"
+  val_data: "Data/val_list_multispeaker.cleaned.txt"  # use the multispeaker validation split (<=8s recommended)
+  test_data: "Data/test_list_multispeaker.cleaned.txt"
+  root_path: "/scratch/anderarrigandiaga/data/tts/eu/sonora/"
+  OOD_data: "Data/OOD_eu.cleaned.txt"  # optional OOD set (kept as example)
+  min_length: 50 # sample until texts with this size are obtained for OOD texts
+preprocess_params:
+  sr: 24000
+  spect_params:
+    n_fft: 2048
+    win_length: 1200
+    hop_length: 300
+model_params:
+  multispeaker: true
+  dim_in: 64
+  hidden_dim: 512
+  max_conv_dim: 512
+  n_layer: 3
+  n_mels: 80
+  n_token: 178 # number of phoneme tokens
+  max_dur: 50 # maximum duration of a single phoneme
+  style_dim: 128 # style vector size
+  dropout: 0.2
+  # config for decoder
+  decoder:
+      type: 'hifigan' # either hifigan or istftnet
+      resblock_kernel_sizes: [3,7,11]
+      upsample_rates: [10,5,3,2]
+      upsample_initial_channel: 512
+      resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
+      upsample_kernel_sizes: [20,10,6,4]
+  # speech language model config
+  slm:
+      model: 'microsoft/wavlm-base-plus'
+      sr: 16000 # sampling rate of SLM
+      hidden: 768 # hidden size of SLM
+      nlayers: 13 # number of layers of SLM
+      initial_channel: 64 # initial channels of SLM discriminator head
+  # style diffusion model config
+  diffusion:
+    embedding_mask_proba: 0.1
+    # transformer config
+    transformer:
+      num_layers: 3
+      num_heads: 8
+      head_features: 64
+      multiplier: 2
+    # diffusion distribution config
+    dist:
+      sigma_data: 0.2 # placeholder for estimate_sigma_data set to false
+      estimate_sigma_data: true # estimate sigma_data from the current batch if set to true
+      mean: -3.0
+      std: 1.0
+loss_params:
+    lambda_mel: 5. # mel reconstruction loss
+    lambda_gen: 1. # generator loss
+    lambda_slm: 1. # slm feature matching loss
+    lambda_mono: 1. # monotonic alignment loss (1st stage, TMA)
+    lambda_s2s: 1. # sequence-to-sequence loss (1st stage, TMA)
+    TMA_epoch: 5 # TMA starting epoch (1st stage)
+    lambda_F0: 1. # F0 reconstruction loss (2nd stage)
+    lambda_norm: 1. # norm reconstruction loss (2nd stage)
+    lambda_dur: 1. # duration loss (2nd stage)
+    lambda_ce: 20. # duration predictor probability output CE loss (2nd stage)
+    lambda_sty: 1. # style reconstruction loss (2nd stage)
+    lambda_diff: 1. # score matching loss (2nd stage)
+    diff_epoch: 10 # style diffusion starting epoch (2nd stage)
+    joint_epoch: 15 # joint training starting epoch (2nd stage)
+optimizer_params:
+  lr: 0.0001 # general learning rate
+  bert_lr: 0.00001 # learning rate for PLBERT
+  ft_lr: 0.00001 # learning rate for acoustic modules
+slmadv_params:
+  min_len: 400 # minimum length of samples
+  max_len: 500 # maximum length of samples
+  batch_percentage: 0.5 # to prevent out of memory, only use half of the original batch size
+  iter: 20 # update the discriminator every this iterations of generator update
+  thresh: 5 # gradient norm above which the gradient is scaled
+  scale: 0.01 # gradient scaling factor for predictors from SLM discriminators
+  sig: 1.5 # sigma for differentiable duration modeling

epoch_00200.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9df6632c8b1f7dd628696bf6326005422bfa5c4c49a74de5c59369fe7bf34056
+size 94573449

epoch_2nd_00030.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb56c2bf275f9c60a052cc71412c1cc0752a0c2b744bbc9bae6a77e0a47c6f6c
+size 2135548572

sample_antton.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:909d09a2a8454ff0a065f544f5307904eb3d72b993cdb2c55a67da129f94f6af
+size 265144

sample_maider.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1e4686b404895f174052859f55b6a4184fc9442c469b594782d39a76b1ba48bf
+size 129544

step_4000000.t7 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cd5f5e669db09e598da990fe4e8897128bd8f7ffa15b877151b15b7521565d4a
+size 533867882