PL-BERT-wp-eu / README.md

Ander Arriandiaga

Init repo: configure LFS and ignore phonemizer/

2395c1f 1 day ago

4.94 kB

license: apache-2.0
language:
  - eu
tags:
  - TTS
  - PL-BERT
  - WordPiece
  - hitz-aholab

PL-BERT-eu

Overview

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional information

Model Description

PL-BERT-eu is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on PL-BERT architecture and learns phoneme representations via a masked language modeling objective.

This model supports phoneme-based text-to-speech (TTS) systems, such as StyleTTS2 using Basque-specific phoneme vocabulary and contextual embeddings.

Features of our PL-BERT:

It is trained exclusively on Basque phonemized Wikipedia text.
It uses a reduced phoneme vocabulary of 178 tokens.
It utilizes a WordPiece tokenizer for phonemized Basque text.
It includes a custom token_maps_eu.pkl and adapted util.py.

Intended Uses and Limitations

Intended uses

Integration into phoneme-based TTS pipelines such as StyleTTS2.
Speech synthesis and phoneme embedding extraction for Basque.

Limitations

Not designed for general NLP tasks.
Only supports Basque phoneme tokens.

How to Get Started with the Model

Here is an example of how to use this model within the StyleTTS2 framework:

Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2
Inside the Utils directory, create a new folder, for example: PLBERT_eu.
Copy the following files into that folder:
- config.yml (training configuration)
- step_4000000.t7 (trained checkpoint)
- util.py (modified to fix position ID loading)
In your StyleTTS2 configuration file, update the PLBERT_dir entry to:

PLBERT_dir: Utils/PLBERT_eu
Update the import statement in your code to:

from Utils.PLBERT_eu.util import load_plbert
We used code developed by Aholab to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at arrandi/phonemizer-eus-esp. Likewise, the code used to generate IPA phonemes can be found in the phonemizer directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

Note: If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see issue #254 in the original StyleTTS2 repository.

Training Details

Training data

The model was trained on a Basque corpus phonemized using Modelo1y2. It uses a consistent phoneme token set with boundary markers and masking tokens.

Tokenizer: custom (splits on whitespace)
Phoneme masking strategy: phoneme-level masking and replacement
Training steps: 4,000,000
Precision: mixed-precision (fp16)

Training configuration

Model parameters:

Vocabulary size: 178
Hidden size: 768
Attention heads: 12
Intermediate size: 2048
Number of layers: 12
Max position embeddings: 512
Dropout: 0.1
Embedding size: 128
Number of hidden groups: 1
Number of hidden layers per group: 12
Inner group number: 1
Downscale factor: 1

Other parameters:

Batch size: 32
Max mel length: 512
Word mask probability: 0.15
Phoneme mask probability: 0.1
Replacement probability: 0.2
Token separator: space
Token mask: M
Word separator ID: 2
Scheduler type: OneCycleLR
Learning rate: 0.0002
pct_start: 0.1
Annealing strategy: cosine annealing
div_factor: 25
final_div_factor: 10000

Evaluation

The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.

Citation

If this code contributes to your research, please cite the work:

@misc{aarriandiagaplberteu,
   title={PL-BERT-eu}, 
   author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
   organization={Hitz (Aholab) - EHU},
   url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
   year={2026}
}

Additional Information

Author

Author: Ander Arriandiaga — Aholab (Hitz), EHU

Contact

For further information, please send an email to inma.hernaez@ehu.eus.

Copyright

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.