Ander Arriandiaga
commited on
Commit
·
2395c1f
1
Parent(s):
c6b7fda
Init repo: configure LFS and ignore phonemizer/
Browse files- README.md +170 -0
- config.yml +59 -0
- step_4000000.t7 +3 -0
- token_maps_eu.pkl +3 -0
- util.py +47 -0
README.md
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- eu
|
| 5 |
+
tags:
|
| 6 |
+
- TTS
|
| 7 |
+
- PL-BERT
|
| 8 |
+
- WordPiece
|
| 9 |
+
- hitz-aholab
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# PL-BERT-eu
|
| 13 |
+
|
| 14 |
+
## Overview
|
| 15 |
+
|
| 16 |
+
<details>
|
| 17 |
+
<summary>Click to expand</summary>
|
| 18 |
+
|
| 19 |
+
- [Model Description](#model-description)
|
| 20 |
+
- [Intended Uses and Limitations](#intended-uses-and-limitations)
|
| 21 |
+
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
| 22 |
+
- [Training Details](#training-details)
|
| 23 |
+
- [Citation](#citation)
|
| 24 |
+
- [Additional information](#additional-information)
|
| 25 |
+
|
| 26 |
+
</details>
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## Model Description
|
| 32 |
+
|
| 33 |
+
**PL-BERT-eu** is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on [PL-BERT architecture](https://github.com/yl4579/PL-BERT) and learns phoneme representations via a masked language modeling objective.
|
| 34 |
+
|
| 35 |
+
This model supports **phoneme-based text-to-speech (TTS) systems**, such as [StyleTTS2](https://github.com/yl4579/StyleTTS2) using Basque-specific phoneme vocabulary and contextual embeddings.
|
| 36 |
+
|
| 37 |
+
Features of our PL-BERT:
|
| 38 |
+
- It is trained **exclusively on Basque** phonemized Wikipedia text.
|
| 39 |
+
- It uses a reduced **phoneme vocabulary of 178 tokens**.
|
| 40 |
+
- It utilizes a WordPiece tokenizer for phonemized Basque text.
|
| 41 |
+
- It includes a custom `token_maps_eu.pkl` and adapted `util.py`.
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Intended Uses and Limitations
|
| 46 |
+
|
| 47 |
+
### Intended uses
|
| 48 |
+
|
| 49 |
+
- Integration into phoneme-based TTS pipelines such as StyleTTS2.
|
| 50 |
+
- Speech synthesis and phoneme embedding extraction for Basque.
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
### Limitations
|
| 54 |
+
|
| 55 |
+
- Not designed for general NLP tasks.
|
| 56 |
+
- Only supports Basque phoneme tokens.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## How to Get Started with the Model
|
| 61 |
+
|
| 62 |
+
Here is an example of how to use this model within the StyleTTS2 framework:
|
| 63 |
+
|
| 64 |
+
1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2
|
| 65 |
+
2. Inside the `Utils` directory, create a new folder, for example: `PLBERT_eu`.
|
| 66 |
+
3. Copy the following files into that folder:
|
| 67 |
+
- `config.yml` (training configuration)
|
| 68 |
+
- `step_4000000.t7` (trained checkpoint)
|
| 69 |
+
- `util.py` (modified to fix position ID loading)
|
| 70 |
+
|
| 71 |
+
4. In your StyleTTS2 configuration file, update the `PLBERT_dir` entry to:
|
| 72 |
+
|
| 73 |
+
`PLBERT_dir: Utils/PLBERT_eu`
|
| 74 |
+
|
| 75 |
+
5. Update the import statement in your code to:
|
| 76 |
+
|
| 77 |
+
`from Utils.PLBERT_eu.util import load_plbert`
|
| 78 |
+
|
| 79 |
+
6. We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
|
| 80 |
+
|
| 81 |
+
**Note:** If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see [issue #254](https://github.com/yl4579/StyleTTS2/issues/254) in the original StyleTTS2 repository.
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## Training Details
|
| 86 |
+
|
| 87 |
+
### Training data
|
| 88 |
+
|
| 89 |
+
The model was trained on a Basque corpus phonemized using **Modelo1y2**. It uses a consistent phoneme token set with boundary markers and masking tokens.
|
| 90 |
+
|
| 91 |
+
Tokenizer: custom (splits on whitespace)
|
| 92 |
+
Phoneme masking strategy: phoneme-level masking and replacement
|
| 93 |
+
Training steps: 4,000,000
|
| 94 |
+
Precision: mixed-precision (fp16)
|
| 95 |
+
|
| 96 |
+
### Training configuration
|
| 97 |
+
|
| 98 |
+
Model parameters:
|
| 99 |
+
|
| 100 |
+
- Vocabulary size: 178
|
| 101 |
+
- Hidden size: 768
|
| 102 |
+
- Attention heads: 12
|
| 103 |
+
- Intermediate size: 2048
|
| 104 |
+
- Number of layers: 12
|
| 105 |
+
- Max position embeddings: 512
|
| 106 |
+
- Dropout: 0.1
|
| 107 |
+
- Embedding size: 128
|
| 108 |
+
- Number of hidden groups: 1
|
| 109 |
+
- Number of hidden layers per group: 12
|
| 110 |
+
- Inner group number: 1
|
| 111 |
+
- Downscale factor: 1
|
| 112 |
+
|
| 113 |
+
Other parameters:
|
| 114 |
+
|
| 115 |
+
- Batch size: 32
|
| 116 |
+
- Max mel length: 512
|
| 117 |
+
- Word mask probability: 0.15
|
| 118 |
+
- Phoneme mask probability: 0.1
|
| 119 |
+
- Replacement probability: 0.2
|
| 120 |
+
- Token separator: space
|
| 121 |
+
- Token mask: M
|
| 122 |
+
- Word separator ID: 2
|
| 123 |
+
- Scheduler type: OneCycleLR
|
| 124 |
+
- Learning rate: 0.0002
|
| 125 |
+
- pct_start: 0.1
|
| 126 |
+
- Annealing strategy: cosine annealing
|
| 127 |
+
- div_factor: 25
|
| 128 |
+
- final_div_factor: 10000
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
### Evaluation
|
| 132 |
+
|
| 133 |
+
The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## Citation
|
| 138 |
+
|
| 139 |
+
If this code contributes to your research, please cite the work:
|
| 140 |
+
|
| 141 |
+
```
|
| 142 |
+
@misc{aarriandiagaplberteu,
|
| 143 |
+
title={PL-BERT-eu},
|
| 144 |
+
author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
|
| 145 |
+
organization={Hitz (Aholab) - EHU},
|
| 146 |
+
url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
|
| 147 |
+
year={2026}
|
| 148 |
+
}
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
## Additional Information
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
### Author
|
| 155 |
+
|
| 156 |
+
Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
|
| 157 |
+
|
| 158 |
+
### Contact
|
| 159 |
+
For further information, please send an email to <inma.hernaez@ehu.eus>.
|
| 160 |
+
|
| 161 |
+
### Copyright
|
| 162 |
+
Copyright(c) 2026 by Aholab, HiTZ.
|
| 163 |
+
|
| 164 |
+
### License
|
| 165 |
+
|
| 166 |
+
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
### Funding
|
| 170 |
+
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.
|
config.yml
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Training configuration for Phoneme Tokenizer - based on WB run ofnglulb
|
| 2 |
+
model_type: "albert"
|
| 3 |
+
|
| 4 |
+
log_dir: "Checkpoint_Phoneme_Albert_correct_0002"
|
| 5 |
+
mixed_precision: "fp16"
|
| 6 |
+
data_folder: "wiki_phoneme/eu/dataset_v2_fixed_clean"
|
| 7 |
+
batch_size: 32
|
| 8 |
+
# Align save/log intervals with production
|
| 9 |
+
save_interval: 10000
|
| 10 |
+
log_interval: 1000
|
| 11 |
+
num_process: 1
|
| 12 |
+
# Full training steps from production
|
| 13 |
+
num_steps: 4000000
|
| 14 |
+
# Learning rate and scheduler to match production onecycle
|
| 15 |
+
learning_rate: 0.0002
|
| 16 |
+
alignment_approach: "phoneme"
|
| 17 |
+
|
| 18 |
+
# Scheduler configuration
|
| 19 |
+
scheduler_type: onecycle
|
| 20 |
+
warmup_ratio: 0.1
|
| 21 |
+
anneal_strategy: cos
|
| 22 |
+
div_factor: 25
|
| 23 |
+
final_div_factor: 10000
|
| 24 |
+
|
| 25 |
+
# Wandb configuration
|
| 26 |
+
wandb:
|
| 27 |
+
project: "basque-pl-bert"
|
| 28 |
+
experiment_name: "Phoneme_Albert_correct_phoneme_0002"
|
| 29 |
+
entity: null
|
| 30 |
+
tags: ["basque", "phoneme", "albert", "correct"]
|
| 31 |
+
|
| 32 |
+
# Dataset parameters
|
| 33 |
+
dataset_params:
|
| 34 |
+
tokenizer_type: "phoneme"
|
| 35 |
+
phoneme_tokenizer_path: "tokenizer/token_maps_eu.pkl"
|
| 36 |
+
tokenizer: "ixa-ehu/berteus-base-cased"
|
| 37 |
+
token_maps: "token_maps.pkl"
|
| 38 |
+
token_separator: " "
|
| 39 |
+
token_mask: "M"
|
| 40 |
+
word_separator: 2
|
| 41 |
+
max_mel_length: 512
|
| 42 |
+
word_mask_prob: 0.15
|
| 43 |
+
phoneme_mask_prob: 0.1
|
| 44 |
+
replace_prob: 0.2
|
| 45 |
+
|
| 46 |
+
# Model parameters (ALBERT configuration)
|
| 47 |
+
model_params:
|
| 48 |
+
vocab_size: 178
|
| 49 |
+
hidden_size: 768
|
| 50 |
+
num_attention_heads: 12
|
| 51 |
+
intermediate_size: 2048
|
| 52 |
+
max_position_embeddings: 512
|
| 53 |
+
num_hidden_layers: 12
|
| 54 |
+
dropout: 0.1
|
| 55 |
+
embedding_size: 128
|
| 56 |
+
num_hidden_groups: 1
|
| 57 |
+
num_hidden_layers_per_group: 12
|
| 58 |
+
inner_group_num: 1
|
| 59 |
+
down_scale_factor: 1
|
step_4000000.t7
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cd5f5e669db09e598da990fe4e8897128bd8f7ffa15b877151b15b7521565d4a
|
| 3 |
+
size 533867882
|
token_maps_eu.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5dbdd445a7d13f965801266cc35655223444ca7519121ef3def67f65dd9ebc34
|
| 3 |
+
size 713702
|
util.py
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import yaml
|
| 3 |
+
import torch
|
| 4 |
+
from transformers import AlbertConfig, AlbertModel
|
| 5 |
+
|
| 6 |
+
class CustomAlbert(AlbertModel):
|
| 7 |
+
def forward(self, *args, **kwargs):
|
| 8 |
+
# Call the original forward method
|
| 9 |
+
outputs = super().forward(*args, **kwargs)
|
| 10 |
+
|
| 11 |
+
# Only return the last_hidden_state
|
| 12 |
+
return outputs.last_hidden_state
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def load_plbert(log_dir):
|
| 16 |
+
config_path = os.path.join(log_dir, "config.yml")
|
| 17 |
+
plbert_config = yaml.safe_load(open(config_path))
|
| 18 |
+
|
| 19 |
+
albert_base_configuration = AlbertConfig(**plbert_config['model_params'])
|
| 20 |
+
bert = CustomAlbert(albert_base_configuration)
|
| 21 |
+
|
| 22 |
+
files = os.listdir(log_dir)
|
| 23 |
+
ckpts = []
|
| 24 |
+
for f in os.listdir(log_dir):
|
| 25 |
+
if f.startswith("step_"): ckpts.append(f)
|
| 26 |
+
|
| 27 |
+
iters = [int(f.split('_')[-1].split('.')[0]) for f in ckpts if os.path.isfile(os.path.join(log_dir, f))]
|
| 28 |
+
iters = sorted(iters)[-1]
|
| 29 |
+
|
| 30 |
+
try:
|
| 31 |
+
checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu', weights_only=False)
|
| 32 |
+
except TypeError:
|
| 33 |
+
checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu')
|
| 34 |
+
state_dict = checkpoint['net']
|
| 35 |
+
from collections import OrderedDict
|
| 36 |
+
new_state_dict = OrderedDict()
|
| 37 |
+
for k, v in state_dict.items():
|
| 38 |
+
name = k[7:] # remove `module.`
|
| 39 |
+
if name.startswith('encoder.'):
|
| 40 |
+
name = name[8:] # remove `encoder.`
|
| 41 |
+
new_state_dict[name] = v
|
| 42 |
+
# remove optional keys that may not exist across different checkpoint formats
|
| 43 |
+
new_state_dict.pop("embeddings.position_ids", None)
|
| 44 |
+
new_state_dict.pop("position_ids", None)
|
| 45 |
+
bert.load_state_dict(new_state_dict, strict=False)
|
| 46 |
+
|
| 47 |
+
return bert
|