Catalan
TTS
StyleTTS2
valenciano
catalan
styletts2
gplsi
StyleTTS2-va / README.md
javiimts's picture
Add model card
29dfa56 verified
---
license: apache-2.0
language:
- ca
datasets:
- gplsi/corts-valencianes-asr
tags:
- TTS
- StyleTTS2
- valenciano
- catalan
- styletts2
- gplsi
---
# StyleTTS2 Valenciano
## Overview
<details>
<summary>Click to expand</summary>
- [Model Description](#model-description)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Citation](#citation)
- [Additional Information](#additional-information)
</details>
---
## Model Description
**StyleTTS2 Valenciano** is an end-to-end text-to-speech model for the **Valencian variant of Catalan**, adapted from [StyleTTS2](https://github.com/yl4579/StyleTTS2) and trained on parliamentary recordings from the Corts Valencianes corpus ([gplsi/corts-valencianes-asr](https://huggingface.co/datasets/gplsi/corts-valencianes-asr)).
The model follows the original two-stage StyleTTS2 architecture:
1. **Acoustic backbone** (Stage 1): text aligner, acoustic text encoder, pitch extractor, acoustic style encoder, and iSTFTNet decoder. Trained for 100 epochs.
2. **Prosodic and diffusion modules** (Stage 2): prosodic text encoder (our valenciano PL-BERT, [gplsi/PL-BERT-va](https://huggingface.co/gplsi/PL-BERT-va)), duration and prosody predictor, prosodic style encoder, style diffusion denoiser, and SLM adversarial training using WavLM. Trained for 100 epochs in three progressive phases.
Features of this model:
- Trained on **89,330 utterances** from the Corts Valencianes corpus.
- Input is **phoneme strings** in IPA (178-symbol vocabulary, char-level).
- Uses our [gplsi/PL-BERT-va](https://huggingface.co/gplsi/PL-BERT-va) as the prosodic text encoder, which provides semantic context to the prosody predictor.
- Audio is generated at **24 kHz**.
### Modifications and bug fixes
Several bugs in the original StyleTTS2 codebase were found and fixed during the adaptation. These fixes are essential for reproducible training:
1. **Checkpoint key mismatch when loading Stage 1 into Stage 2** (GitHub issue [yl4579/StyleTTS2#254](https://github.com/yl4579/StyleTTS2/issues/254)): the original `load_checkpoint` used `strict=False` and could silently fail to load weights, producing NaN loss from step 0. Fixed by trying `strict=True` first and remapping the `module.` prefix on failure.
2. **JDC pitch extractor squeeze bug**: `squeeze()` without an argument removed the batch dimension when `batch_size == 1` per GPU. Replaced with `squeeze(-1)`.
3. **PL-BERT 512 position limit**: parliamentary sentences with > 512 phoneme characters crashed the BERT encoder. Affected batches are now skipped.
4. **Style encoder minimum input size**: the style encoder (4× downsample + 5×5 conv) requires at least 80 mel frames. Shorter samples crashed `predictor_encoder`. Affected batches are now skipped; `min_length` raised from 50 to 100 in the config.
5. **Misuse of `pretrained_model`**: pointing `pretrained_model` to `first_stage.pth` causes double-loading and optimizer corruption. Documented and protected against in the training guide.
---
## Intended Uses and Limitations
### Intended uses
- Research and educational use for Catalan/Valencian text-to-speech synthesis.
- As a baseline for further fine-tuning on better-curated multi-speaker Valencian corpora.
- For inference, use the LibriTTS-style notebook that accepts a reference audio (see [How to Get Started with the Model](#how-to-get-started-with-the-model)).
### Limitations
The most important limitation of this model is that it was **trained as single-speaker on multi-speaker data**. The Corts Valencianes corpus contains many different politicians' voices, but all training samples were labelled with `speaker_id = 0`. As a consequence:
- The style encoder learned an "averaged" style across all speakers.
- The style diffusion denoiser learned a high-variance, inconsistent style distribution.
- **Pure text-to-speech inference** (without reference audio, using `Inference_LJSpeech.ipynb`) produces noisy and inconsistent voices.
- **Reference-based inference** (using `StyleTTS2_Demo_LibriTTS.ipynb` with a chosen reference audio) produces noticeably cleaner audio and is the **recommended way to use this model**.
Other limitations:
- The model is specific to Valencian / Catalan phonology. Using it for other languages requires re-training.
- The training corpus is parliamentary speech, which means the model is biased towards a formal speaking style; conversational or expressive synthesis may be lower quality.
---
## How to Get Started with the Model
### Installation
Clone the training repository and install the dependencies:
```bash
git clone https://github.com/javimosa/styletts2-valenciano.git
cd styletts2-valenciano
python -m venv .venv && source .venv/bin/activate
pip install -r StyleTTS2/requirements.txt
```
You also need `espeak-ng` with support for the `ca-va` variant. See the project's [t3/README.md](https://github.com/javimosa/styletts2-valenciano/blob/main/t3/README.md) for instructions.
### Download the model and PL-BERT
```bash
# StyleTTS2 valenciano checkpoint
huggingface-cli download gplsi/StyleTTS2-va \
Models/LJSpeech/epoch_2nd_00100.pth \
Models/LJSpeech/config.yml \
--local-dir StyleTTS2/
# PL-BERT (used as the prosodic text encoder)
huggingface-cli download gplsi/PL-BERT-va \
step_50000.t7 config.yml util.py \
--local-dir StyleTTS2/Utils/PLBERT/
```
### Inference
We recommend the reference-audio notebook:
```bash
jupyter notebook StyleTTS2/Colab/StyleTTS2_Demo_LibriTTS.ipynb
```
Provide a reference audio of the target voice and the Valencian text to synthesize. See the project repository for full inference details.
The pure text-to-speech notebook ([`StyleTTS2/Demo/Inference_LJSpeech.ipynb`](https://github.com/javimosa/styletts2-valenciano/blob/main/StyleTTS2/Demo/Inference_LJSpeech.ipynb)) is also available.
---
## Training Details
### Training data
The model was trained on the [`gplsi/corts-valencianes-asr`](https://huggingface.co/datasets/gplsi/corts-valencianes-asr) dataset (a parquet conversion of the public Corts Valencianes ASR corpus, originally released by Projecte Aina at the BSC).
| Split | Samples used |
|-------|--------------|
| Train | 89,330 (combined `clean_train_short` + `other_train_short`, length-filtered to 1–15 s) |
| Validation | 2,791 |
Audio was resampled from the original 16 kHz to 24 kHz to match the StyleTTS2 pipeline. Texts were phonemized at the sentence level (preserving inter-word phonetic effects) using `phonemizer` with `espeak-ng` and the `ca-va` variant.
### Training configuration
Both stages were trained using `accelerate` with mixed precision (fp16) on 4 × NVIDIA A100 64 GB GPUs.
**Stage 1 — acoustic backbone** (`train_first.py`):
- Epochs: 100
- Batch size: 8
- Max mel length: 512 frames
- TMA epoch (alignment refinement): 25
- Optimizer: AdamW, learning rate 1e-4
**Stage 2 — prosodic and diffusion modules** (`train_second.py`):
- Epochs: 100 (three phases)
- Phase 1 (epochs 0 → 20): predictor, BERT, bert_encoder, predictor_encoder
- Phase 2 (epochs 20 → 50): + diffusion
- Phase 3 (epochs 50 → 100): + decoder and style_encoder fine-tuning + SLM adversarial (WavLM)
- BERT learning rate: 1e-5 (already pre-trained)
- Other learning rates: 1e-4
- Single-speaker mode (`multispeaker: false`)
The full configuration is in `Models/LJSpeech/config.yml` inside this repository.
### Model parameters
| Parameter | Value |
|---|---|
| Phoneme vocabulary (`n_token`) | 178 |
| Hidden dimension | 512 |
| Style dimension | 128 |
| Mel channels | 80 |
| Sample rate | 24 kHz |
| Decoder | iSTFTNet |
| Diffusion transformer | 3 layers, 8 heads |
| SLM | `microsoft/wavlm-base-plus` |
---
## Evaluation
The model was evaluated qualitatively by synthesizing a set of test sentences in Valencian and having native Valencian speakers assess the naturalness of the generated audio. Inference was performed with the reference-audio notebook (`StyleTTS2_Demo_LibriTTS.ipynb`), which is the recommended way to use this model (see [Limitations](#intended-uses-and-limitations)).
The native-speaker assessment confirms that:
- The pronunciation is recognizably Valencian, with the expected vowel realizations and word-level stress patterns.
- The main quality issue reported is background noise in the generated audio, particularly at the beginning and end of utterances. This noise is more pronounced when generating without a reference audio, and is attributable to the single-speaker / multi-speaker training mismatch described above. Pronunciation and naturalness themselves are not affected.
The model has not yet been benchmarked with formal subjective (MOS) or objective (WER, CER, ScoreQ) metrics. A future iteration with proper per-speaker labelling of the corpus could be valuable to substantially improve quality and stability.
---
## Citation
If this model contributes to your research, please cite the work:
```
@misc{gplsi2026styletts2valenciano,
title={StyleTTS2 Valenciano},
author={GPLSI Group, Universidad de Alicante},
url={https://huggingface.co/gplsi/StyleTTS2-va},
year={2026}
}
```
We also recommend citing the original StyleTTS2 and PL-BERT papers:
```
@inproceedings{li2023styletts2,
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
author={Li, Yinghao Aaron and Han, Cong and Raghavan, Vinay S and Mischler, Gavin and Mesgarani, Nima},
booktitle={Advances in Neural Information Processing Systems},
year={2023}
}
@misc{li2023plbert,
title={Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions},
author={Yinghao Aaron Li and Cong Han and Xilin Jiang and Nima Mesgarani},
year={2023},
eprint={2301.08810},
archivePrefix={arXiv}
}
```
---
## Additional Information
### Author
The [GPLSI](https://gplsi.dlsi.ua.es/) research group at the University of Alicante.
### Contact
For further information, please open an issue in the [project repository](https://github.com/javimosa/styletts2-valenciano).
### Copyright
Copyright (c) 2026 GPLSI, Universidad de Alicante.
### License
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)