Ander Arriandiaga commited on
Commit
2395c1f
·
1 Parent(s): c6b7fda

Init repo: configure LFS and ignore phonemizer/

Browse files
Files changed (5) hide show
  1. README.md +170 -0
  2. config.yml +59 -0
  3. step_4000000.t7 +3 -0
  4. token_maps_eu.pkl +3 -0
  5. util.py +47 -0
README.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - eu
5
+ tags:
6
+ - TTS
7
+ - PL-BERT
8
+ - WordPiece
9
+ - hitz-aholab
10
+ ---
11
+
12
+ # PL-BERT-eu
13
+
14
+ ## Overview
15
+
16
+ <details>
17
+ <summary>Click to expand</summary>
18
+
19
+ - [Model Description](#model-description)
20
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
21
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
22
+ - [Training Details](#training-details)
23
+ - [Citation](#citation)
24
+ - [Additional information](#additional-information)
25
+
26
+ </details>
27
+
28
+
29
+ ---
30
+
31
+ ## Model Description
32
+
33
+ **PL-BERT-eu** is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on [PL-BERT architecture](https://github.com/yl4579/PL-BERT) and learns phoneme representations via a masked language modeling objective.
34
+
35
+ This model supports **phoneme-based text-to-speech (TTS) systems**, such as [StyleTTS2](https://github.com/yl4579/StyleTTS2) using Basque-specific phoneme vocabulary and contextual embeddings.
36
+
37
+ Features of our PL-BERT:
38
+ - It is trained **exclusively on Basque** phonemized Wikipedia text.
39
+ - It uses a reduced **phoneme vocabulary of 178 tokens**.
40
+ - It utilizes a WordPiece tokenizer for phonemized Basque text.
41
+ - It includes a custom `token_maps_eu.pkl` and adapted `util.py`.
42
+
43
+ ---
44
+
45
+ ## Intended Uses and Limitations
46
+
47
+ ### Intended uses
48
+
49
+ - Integration into phoneme-based TTS pipelines such as StyleTTS2.
50
+ - Speech synthesis and phoneme embedding extraction for Basque.
51
+
52
+
53
+ ### Limitations
54
+
55
+ - Not designed for general NLP tasks.
56
+ - Only supports Basque phoneme tokens.
57
+
58
+ ---
59
+
60
+ ## How to Get Started with the Model
61
+
62
+ Here is an example of how to use this model within the StyleTTS2 framework:
63
+
64
+ 1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2
65
+ 2. Inside the `Utils` directory, create a new folder, for example: `PLBERT_eu`.
66
+ 3. Copy the following files into that folder:
67
+ - `config.yml` (training configuration)
68
+ - `step_4000000.t7` (trained checkpoint)
69
+ - `util.py` (modified to fix position ID loading)
70
+
71
+ 4. In your StyleTTS2 configuration file, update the `PLBERT_dir` entry to:
72
+
73
+ `PLBERT_dir: Utils/PLBERT_eu`
74
+
75
+ 5. Update the import statement in your code to:
76
+
77
+ `from Utils.PLBERT_eu.util import load_plbert`
78
+
79
+ 6. We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
80
+
81
+ **Note:** If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see [issue #254](https://github.com/yl4579/StyleTTS2/issues/254) in the original StyleTTS2 repository.
82
+
83
+ ---
84
+
85
+ ## Training Details
86
+
87
+ ### Training data
88
+
89
+ The model was trained on a Basque corpus phonemized using **Modelo1y2**. It uses a consistent phoneme token set with boundary markers and masking tokens.
90
+
91
+ Tokenizer: custom (splits on whitespace)
92
+ Phoneme masking strategy: phoneme-level masking and replacement
93
+ Training steps: 4,000,000
94
+ Precision: mixed-precision (fp16)
95
+
96
+ ### Training configuration
97
+
98
+ Model parameters:
99
+
100
+ - Vocabulary size: 178
101
+ - Hidden size: 768
102
+ - Attention heads: 12
103
+ - Intermediate size: 2048
104
+ - Number of layers: 12
105
+ - Max position embeddings: 512
106
+ - Dropout: 0.1
107
+ - Embedding size: 128
108
+ - Number of hidden groups: 1
109
+ - Number of hidden layers per group: 12
110
+ - Inner group number: 1
111
+ - Downscale factor: 1
112
+
113
+ Other parameters:
114
+
115
+ - Batch size: 32
116
+ - Max mel length: 512
117
+ - Word mask probability: 0.15
118
+ - Phoneme mask probability: 0.1
119
+ - Replacement probability: 0.2
120
+ - Token separator: space
121
+ - Token mask: M
122
+ - Word separator ID: 2
123
+ - Scheduler type: OneCycleLR
124
+ - Learning rate: 0.0002
125
+ - pct_start: 0.1
126
+ - Annealing strategy: cosine annealing
127
+ - div_factor: 25
128
+ - final_div_factor: 10000
129
+
130
+
131
+ ### Evaluation
132
+
133
+ The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.
134
+
135
+ ---
136
+
137
+ ## Citation
138
+
139
+ If this code contributes to your research, please cite the work:
140
+
141
+ ```
142
+ @misc{aarriandiagaplberteu,
143
+ title={PL-BERT-eu},
144
+ author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
145
+ organization={Hitz (Aholab) - EHU},
146
+ url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
147
+ year={2026}
148
+ }
149
+ ```
150
+
151
+ ## Additional Information
152
+
153
+
154
+ ### Author
155
+
156
+ Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
157
+
158
+ ### Contact
159
+ For further information, please send an email to <inma.hernaez@ehu.eus>.
160
+
161
+ ### Copyright
162
+ Copyright(c) 2026 by Aholab, HiTZ.
163
+
164
+ ### License
165
+
166
+ [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
167
+
168
+
169
+ ### Funding
170
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.
config.yml ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training configuration for Phoneme Tokenizer - based on WB run ofnglulb
2
+ model_type: "albert"
3
+
4
+ log_dir: "Checkpoint_Phoneme_Albert_correct_0002"
5
+ mixed_precision: "fp16"
6
+ data_folder: "wiki_phoneme/eu/dataset_v2_fixed_clean"
7
+ batch_size: 32
8
+ # Align save/log intervals with production
9
+ save_interval: 10000
10
+ log_interval: 1000
11
+ num_process: 1
12
+ # Full training steps from production
13
+ num_steps: 4000000
14
+ # Learning rate and scheduler to match production onecycle
15
+ learning_rate: 0.0002
16
+ alignment_approach: "phoneme"
17
+
18
+ # Scheduler configuration
19
+ scheduler_type: onecycle
20
+ warmup_ratio: 0.1
21
+ anneal_strategy: cos
22
+ div_factor: 25
23
+ final_div_factor: 10000
24
+
25
+ # Wandb configuration
26
+ wandb:
27
+ project: "basque-pl-bert"
28
+ experiment_name: "Phoneme_Albert_correct_phoneme_0002"
29
+ entity: null
30
+ tags: ["basque", "phoneme", "albert", "correct"]
31
+
32
+ # Dataset parameters
33
+ dataset_params:
34
+ tokenizer_type: "phoneme"
35
+ phoneme_tokenizer_path: "tokenizer/token_maps_eu.pkl"
36
+ tokenizer: "ixa-ehu/berteus-base-cased"
37
+ token_maps: "token_maps.pkl"
38
+ token_separator: " "
39
+ token_mask: "M"
40
+ word_separator: 2
41
+ max_mel_length: 512
42
+ word_mask_prob: 0.15
43
+ phoneme_mask_prob: 0.1
44
+ replace_prob: 0.2
45
+
46
+ # Model parameters (ALBERT configuration)
47
+ model_params:
48
+ vocab_size: 178
49
+ hidden_size: 768
50
+ num_attention_heads: 12
51
+ intermediate_size: 2048
52
+ max_position_embeddings: 512
53
+ num_hidden_layers: 12
54
+ dropout: 0.1
55
+ embedding_size: 128
56
+ num_hidden_groups: 1
57
+ num_hidden_layers_per_group: 12
58
+ inner_group_num: 1
59
+ down_scale_factor: 1
step_4000000.t7 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cd5f5e669db09e598da990fe4e8897128bd8f7ffa15b877151b15b7521565d4a
3
+ size 533867882
token_maps_eu.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5dbdd445a7d13f965801266cc35655223444ca7519121ef3def67f65dd9ebc34
3
+ size 713702
util.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import yaml
3
+ import torch
4
+ from transformers import AlbertConfig, AlbertModel
5
+
6
+ class CustomAlbert(AlbertModel):
7
+ def forward(self, *args, **kwargs):
8
+ # Call the original forward method
9
+ outputs = super().forward(*args, **kwargs)
10
+
11
+ # Only return the last_hidden_state
12
+ return outputs.last_hidden_state
13
+
14
+
15
+ def load_plbert(log_dir):
16
+ config_path = os.path.join(log_dir, "config.yml")
17
+ plbert_config = yaml.safe_load(open(config_path))
18
+
19
+ albert_base_configuration = AlbertConfig(**plbert_config['model_params'])
20
+ bert = CustomAlbert(albert_base_configuration)
21
+
22
+ files = os.listdir(log_dir)
23
+ ckpts = []
24
+ for f in os.listdir(log_dir):
25
+ if f.startswith("step_"): ckpts.append(f)
26
+
27
+ iters = [int(f.split('_')[-1].split('.')[0]) for f in ckpts if os.path.isfile(os.path.join(log_dir, f))]
28
+ iters = sorted(iters)[-1]
29
+
30
+ try:
31
+ checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu', weights_only=False)
32
+ except TypeError:
33
+ checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu')
34
+ state_dict = checkpoint['net']
35
+ from collections import OrderedDict
36
+ new_state_dict = OrderedDict()
37
+ for k, v in state_dict.items():
38
+ name = k[7:] # remove `module.`
39
+ if name.startswith('encoder.'):
40
+ name = name[8:] # remove `encoder.`
41
+ new_state_dict[name] = v
42
+ # remove optional keys that may not exist across different checkpoint formats
43
+ new_state_dict.pop("embeddings.position_ids", None)
44
+ new_state_dict.pop("position_ids", None)
45
+ bert.load_state_dict(new_state_dict, strict=False)
46
+
47
+ return bert