File size: 4,942 Bytes
2395c1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: apache-2.0
language:
- eu
tags:
- TTS
- PL-BERT
- WordPiece
- hitz-aholab
---

# PL-BERT-eu

## Overview

<details>
<summary>Click to expand</summary>

- [Model Description](#model-description)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Citation](#citation)
- [Additional information](#additional-information)

</details>


---

## Model Description

**PL-BERT-eu** is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on [PL-BERT architecture](https://github.com/yl4579/PL-BERT) and learns phoneme representations via a masked language modeling objective.

This model supports **phoneme-based text-to-speech (TTS) systems**, such as [StyleTTS2](https://github.com/yl4579/StyleTTS2) using Basque-specific phoneme vocabulary and contextual embeddings.

Features of our PL-BERT:
- It is trained **exclusively on Basque** phonemized Wikipedia text.
- It uses a reduced **phoneme vocabulary of 178 tokens**.
- It utilizes a WordPiece tokenizer for phonemized Basque text.
- It includes a custom `token_maps_eu.pkl` and adapted `util.py`.

---

## Intended Uses and Limitations

### Intended uses

- Integration into phoneme-based TTS pipelines such as StyleTTS2.
- Speech synthesis and phoneme embedding extraction for Basque.


### Limitations

- Not designed for general NLP tasks.
- Only supports Basque phoneme tokens.

---

## How to Get Started with the Model

Here is an example of how to use this model within the StyleTTS2 framework:

1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2
2. Inside the `Utils` directory, create a new folder, for example: `PLBERT_eu`.
3. Copy the following files into that folder:
   - `config.yml` (training configuration)
   - `step_4000000.t7` (trained checkpoint)
   - `util.py` (modified to fix position ID loading)

4. In your StyleTTS2 configuration file, update the `PLBERT_dir` entry to:

   `PLBERT_dir: Utils/PLBERT_eu`

5. Update the import statement in your code to:

   `from Utils.PLBERT_eu.util import load_plbert`

6. We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.

**Note:** If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see [issue #254](https://github.com/yl4579/StyleTTS2/issues/254) in the original StyleTTS2 repository.

---

## Training Details

### Training data

The model was trained on a Basque corpus phonemized using **Modelo1y2**. It uses a consistent phoneme token set with boundary markers and masking tokens.

Tokenizer: custom (splits on whitespace)  
Phoneme masking strategy: phoneme-level masking and replacement  
Training steps: 4,000,000  
Precision: mixed-precision (fp16)

### Training configuration

Model parameters:

- Vocabulary size: 178  
- Hidden size: 768  
- Attention heads: 12  
- Intermediate size: 2048  
- Number of layers: 12  
- Max position embeddings: 512  
- Dropout: 0.1
- Embedding size: 128
- Number of hidden groups: 1
- Number of hidden layers per group: 12
- Inner group number: 1
- Downscale factor: 1  

Other parameters:

- Batch size: 32  
- Max mel length: 512  
- Word mask probability: 0.15  
- Phoneme mask probability: 0.1  
- Replacement probability: 0.2  
- Token separator: space  
- Token mask: M  
- Word separator ID: 2
- Scheduler type: OneCycleLR
- Learning rate: 0.0002
- pct_start: 0.1
- Annealing strategy: cosine annealing
- div_factor: 25
- final_div_factor: 10000


### Evaluation

The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.

---

## Citation

If this code contributes to your research, please cite the work:

```
@misc{aarriandiagaplberteu,
   title={PL-BERT-eu}, 
   author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
   organization={Hitz (Aholab) - EHU},
   url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
   year={2026}
}
```

## Additional Information


### Author

Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU

### Contact
For further information, please send an email to <inma.hernaez@ehu.eus>.

### Copyright
Copyright(c) 2026 by Aholab, HiTZ.

### License

[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)


### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.