PL-BERT-wp-eu / README.md
Ander Arriandiaga
Init repo: configure LFS and ignore phonemizer/
2395c1f
metadata
license: apache-2.0
language:
  - eu
tags:
  - TTS
  - PL-BERT
  - WordPiece
  - hitz-aholab

PL-BERT-eu

Overview

Click to expand

Model Description

PL-BERT-eu is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on PL-BERT architecture and learns phoneme representations via a masked language modeling objective.

This model supports phoneme-based text-to-speech (TTS) systems, such as StyleTTS2 using Basque-specific phoneme vocabulary and contextual embeddings.

Features of our PL-BERT:

  • It is trained exclusively on Basque phonemized Wikipedia text.
  • It uses a reduced phoneme vocabulary of 178 tokens.
  • It utilizes a WordPiece tokenizer for phonemized Basque text.
  • It includes a custom token_maps_eu.pkl and adapted util.py.

Intended Uses and Limitations

Intended uses

  • Integration into phoneme-based TTS pipelines such as StyleTTS2.
  • Speech synthesis and phoneme embedding extraction for Basque.

Limitations

  • Not designed for general NLP tasks.
  • Only supports Basque phoneme tokens.

How to Get Started with the Model

Here is an example of how to use this model within the StyleTTS2 framework:

  1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2

  2. Inside the Utils directory, create a new folder, for example: PLBERT_eu.

  3. Copy the following files into that folder:

    • config.yml (training configuration)
    • step_4000000.t7 (trained checkpoint)
    • util.py (modified to fix position ID loading)
  4. In your StyleTTS2 configuration file, update the PLBERT_dir entry to:

    PLBERT_dir: Utils/PLBERT_eu

  5. Update the import statement in your code to:

    from Utils.PLBERT_eu.util import load_plbert

  6. We used code developed by Aholab to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at arrandi/phonemizer-eus-esp. Likewise, the code used to generate IPA phonemes can be found in the phonemizer directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

Note: If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see issue #254 in the original StyleTTS2 repository.


Training Details

Training data

The model was trained on a Basque corpus phonemized using Modelo1y2. It uses a consistent phoneme token set with boundary markers and masking tokens.

Tokenizer: custom (splits on whitespace)
Phoneme masking strategy: phoneme-level masking and replacement
Training steps: 4,000,000
Precision: mixed-precision (fp16)

Training configuration

Model parameters:

  • Vocabulary size: 178
  • Hidden size: 768
  • Attention heads: 12
  • Intermediate size: 2048
  • Number of layers: 12
  • Max position embeddings: 512
  • Dropout: 0.1
  • Embedding size: 128
  • Number of hidden groups: 1
  • Number of hidden layers per group: 12
  • Inner group number: 1
  • Downscale factor: 1

Other parameters:

  • Batch size: 32
  • Max mel length: 512
  • Word mask probability: 0.15
  • Phoneme mask probability: 0.1
  • Replacement probability: 0.2
  • Token separator: space
  • Token mask: M
  • Word separator ID: 2
  • Scheduler type: OneCycleLR
  • Learning rate: 0.0002
  • pct_start: 0.1
  • Annealing strategy: cosine annealing
  • div_factor: 25
  • final_div_factor: 10000

Evaluation

The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.


Citation

If this code contributes to your research, please cite the work:

@misc{aarriandiagaplberteu,
   title={PL-BERT-eu}, 
   author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
   organization={Hitz (Aholab) - EHU},
   url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
   year={2026}
}

Additional Information

Author

Author: Ander Arriandiaga — Aholab (Hitz), EHU

Contact

For further information, please send an email to inma.hernaez@ehu.eus.

Copyright

Copyright(c) 2026 by Aholab, HiTZ.

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.