StyleTTS2 Valenciano
Overview
Click to expand
Model Description
StyleTTS2 Valenciano is an end-to-end text-to-speech model for the Valencian variant of Catalan, adapted from StyleTTS2 and trained on parliamentary recordings from the Corts Valencianes corpus (gplsi/corts-valencianes-asr).
The model follows the original two-stage StyleTTS2 architecture:
- Acoustic backbone (Stage 1): text aligner, acoustic text encoder, pitch extractor, acoustic style encoder, and iSTFTNet decoder. Trained for 100 epochs.
- Prosodic and diffusion modules (Stage 2): prosodic text encoder (our valenciano PL-BERT, gplsi/PL-BERT-va), duration and prosody predictor, prosodic style encoder, style diffusion denoiser, and SLM adversarial training using WavLM. Trained for 100 epochs in three progressive phases.
Features of this model:
- Trained on 89,330 utterances from the Corts Valencianes corpus.
- Input is phoneme strings in IPA (178-symbol vocabulary, char-level).
- Uses our gplsi/PL-BERT-va as the prosodic text encoder, which provides semantic context to the prosody predictor.
- Audio is generated at 24 kHz.
Modifications and bug fixes
Several bugs in the original StyleTTS2 codebase were found and fixed during the adaptation. These fixes are essential for reproducible training:
- Checkpoint key mismatch when loading Stage 1 into Stage 2 (GitHub issue yl4579/StyleTTS2#254): the original
load_checkpointusedstrict=Falseand could silently fail to load weights, producing NaN loss from step 0. Fixed by tryingstrict=Truefirst and remapping themodule.prefix on failure. - JDC pitch extractor squeeze bug:
squeeze()without an argument removed the batch dimension whenbatch_size == 1per GPU. Replaced withsqueeze(-1). - PL-BERT 512 position limit: parliamentary sentences with > 512 phoneme characters crashed the BERT encoder. Affected batches are now skipped.
- Style encoder minimum input size: the style encoder (4× downsample + 5×5 conv) requires at least 80 mel frames. Shorter samples crashed
predictor_encoder. Affected batches are now skipped;min_lengthraised from 50 to 100 in the config. - Misuse of
pretrained_model: pointingpretrained_modeltofirst_stage.pthcauses double-loading and optimizer corruption. Documented and protected against in the training guide.
Intended Uses and Limitations
Intended uses
- Research and educational use for Catalan/Valencian text-to-speech synthesis.
- As a baseline for further fine-tuning on better-curated multi-speaker Valencian corpora.
- For inference, use the LibriTTS-style notebook that accepts a reference audio (see How to Get Started with the Model).
Limitations
The most important limitation of this model is that it was trained as single-speaker on multi-speaker data. The Corts Valencianes corpus contains many different politicians' voices, but all training samples were labelled with speaker_id = 0. As a consequence:
- The style encoder learned an "averaged" style across all speakers.
- The style diffusion denoiser learned a high-variance, inconsistent style distribution.
- Pure text-to-speech inference (without reference audio, using
Inference_LJSpeech.ipynb) produces noisy and inconsistent voices. - Reference-based inference (using
StyleTTS2_Demo_LibriTTS.ipynbwith a chosen reference audio) produces noticeably cleaner audio and is the recommended way to use this model.
Other limitations:
- The model is specific to Valencian / Catalan phonology. Using it for other languages requires re-training.
- The training corpus is parliamentary speech, which means the model is biased towards a formal speaking style; conversational or expressive synthesis may be lower quality.
How to Get Started with the Model
Installation
Clone the training repository and install the dependencies:
git clone https://github.com/javimosa/styletts2-valenciano.git
cd styletts2-valenciano
python -m venv .venv && source .venv/bin/activate
pip install -r StyleTTS2/requirements.txt
You also need espeak-ng with support for the ca-va variant. See the project's t3/README.md for instructions.
Download the model and PL-BERT
# StyleTTS2 valenciano checkpoint
huggingface-cli download gplsi/StyleTTS2-va \
Models/LJSpeech/epoch_2nd_00100.pth \
Models/LJSpeech/config.yml \
--local-dir StyleTTS2/
# PL-BERT (used as the prosodic text encoder)
huggingface-cli download gplsi/PL-BERT-va \
step_50000.t7 config.yml util.py \
--local-dir StyleTTS2/Utils/PLBERT/
Inference
We recommend the reference-audio notebook:
jupyter notebook StyleTTS2/Colab/StyleTTS2_Demo_LibriTTS.ipynb
Provide a reference audio of the target voice and the Valencian text to synthesize. See the project repository for full inference details.
The pure text-to-speech notebook (StyleTTS2/Demo/Inference_LJSpeech.ipynb) is also available.
Training Details
Training data
The model was trained on the gplsi/corts-valencianes-asr dataset (a parquet conversion of the public Corts Valencianes ASR corpus, originally released by Projecte Aina at the BSC).
| Split | Samples used |
|---|---|
| Train | 89,330 (combined clean_train_short + other_train_short, length-filtered to 1–15 s) |
| Validation | 2,791 |
Audio was resampled from the original 16 kHz to 24 kHz to match the StyleTTS2 pipeline. Texts were phonemized at the sentence level (preserving inter-word phonetic effects) using phonemizer with espeak-ng and the ca-va variant.
Training configuration
Both stages were trained using accelerate with mixed precision (fp16) on 4 × NVIDIA A100 64 GB GPUs.
Stage 1 — acoustic backbone (train_first.py):
- Epochs: 100
- Batch size: 8
- Max mel length: 512 frames
- TMA epoch (alignment refinement): 25
- Optimizer: AdamW, learning rate 1e-4
Stage 2 — prosodic and diffusion modules (train_second.py):
- Epochs: 100 (three phases)
- Phase 1 (epochs 0 → 20): predictor, BERT, bert_encoder, predictor_encoder
- Phase 2 (epochs 20 → 50): + diffusion
- Phase 3 (epochs 50 → 100): + decoder and style_encoder fine-tuning + SLM adversarial (WavLM)
- BERT learning rate: 1e-5 (already pre-trained)
- Other learning rates: 1e-4
- Single-speaker mode (
multispeaker: false)
The full configuration is in Models/LJSpeech/config.yml inside this repository.
Model parameters
| Parameter | Value |
|---|---|
Phoneme vocabulary (n_token) |
178 |
| Hidden dimension | 512 |
| Style dimension | 128 |
| Mel channels | 80 |
| Sample rate | 24 kHz |
| Decoder | iSTFTNet |
| Diffusion transformer | 3 layers, 8 heads |
| SLM | microsoft/wavlm-base-plus |
Evaluation
The model was evaluated qualitatively by synthesizing a set of test sentences in Valencian and having native Valencian speakers assess the naturalness of the generated audio. Inference was performed with the reference-audio notebook (StyleTTS2_Demo_LibriTTS.ipynb), which is the recommended way to use this model (see Limitations).
The native-speaker assessment confirms that:
- The pronunciation is recognizably Valencian, with the expected vowel realizations and word-level stress patterns.
- The main quality issue reported is background noise in the generated audio, particularly at the beginning and end of utterances. This noise is more pronounced when generating without a reference audio, and is attributable to the single-speaker / multi-speaker training mismatch described above. Pronunciation and naturalness themselves are not affected.
The model has not yet been benchmarked with formal subjective (MOS) or objective (WER, CER, ScoreQ) metrics. A future iteration with proper per-speaker labelling of the corpus could be valuable to substantially improve quality and stability.
Citation
If this model contributes to your research, please cite the work:
@misc{gplsi2026styletts2valenciano,
title={StyleTTS2 Valenciano},
author={GPLSI Group, Universidad de Alicante},
url={https://huggingface.co/gplsi/StyleTTS2-va},
year={2026}
}
We also recommend citing the original StyleTTS2 and PL-BERT papers:
@inproceedings{li2023styletts2,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay S and Mischler, Gavin and Mesgarani, Nima},
booktitle={Advances in Neural Information Processing Systems},
year={2023}
}
@misc{li2023plbert,
title={Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions},
author={Yinghao Aaron Li and Cong Han and Xilin Jiang and Nima Mesgarani},
year={2023},
eprint={2301.08810},
archivePrefix={arXiv}
}
Additional Information
Author
The GPLSI research group at the University of Alicante.
Contact
For further information, please open an issue in the project repository.
Copyright
Copyright (c) 2026 GPLSI, Universidad de Alicante.