Init repo: configure LFS and ignore phonemizer/

Files changed (5) hide show

README.md +170 -0
config.yml +59 -0
step_4000000.t7 +3 -0
token_maps_eu.pkl +3 -0
util.py +47 -0

README.md ADDED Viewed

	@@ -0,0 +1,170 @@

+---
+license: apache-2.0
+language:
+- eu
+tags:
+- TTS
+- PL-BERT
+- WordPiece
+- hitz-aholab
+---
+# PL-BERT-eu
+## Overview
+<details>
+<summary>Click to expand</summary>
+- [Model Description](#model-description)
+- [Intended Uses and Limitations](#intended-uses-and-limitations)
+- [How to Get Started with the Model](#how-to-get-started-with-the-model)
+- [Training Details](#training-details)
+- [Citation](#citation)
+- [Additional information](#additional-information)
+</details>
+---
+## Model Description
+**PL-BERT-eu** is a phoneme-level masked language model trained on Basque Wikipedia text. It is based on [PL-BERT architecture](https://github.com/yl4579/PL-BERT) and learns phoneme representations via a masked language modeling objective.
+This model supports **phoneme-based text-to-speech (TTS) systems**, such as [StyleTTS2](https://github.com/yl4579/StyleTTS2) using Basque-specific phoneme vocabulary and contextual embeddings.
+Features of our PL-BERT:
+- It is trained **exclusively on Basque** phonemized Wikipedia text.
+- It uses a reduced **phoneme vocabulary of 178 tokens**.
+- It utilizes a WordPiece tokenizer for phonemized Basque text.
+- It includes a custom `token_maps_eu.pkl` and adapted `util.py`.
+---
+## Intended Uses and Limitations
+### Intended uses
+- Integration into phoneme-based TTS pipelines such as StyleTTS2.
+- Speech synthesis and phoneme embedding extraction for Basque.
+### Limitations
+- Not designed for general NLP tasks.
+- Only supports Basque phoneme tokens.
+---
+## How to Get Started with the Model
+Here is an example of how to use this model within the StyleTTS2 framework:
+1. Clone the StyleTTS2 repository: https://github.com/yl4579/StyleTTS2
+2. Inside the `Utils` directory, create a new folder, for example: `PLBERT_eu`.
+3. Copy the following files into that folder:
+   - `config.yml` (training configuration)
+   - `step_4000000.t7` (trained checkpoint)
+   - `util.py` (modified to fix position ID loading)
+4. In your StyleTTS2 configuration file, update the `PLBERT_dir` entry to:
+   `PLBERT_dir: Utils/PLBERT_eu`
+5. Update the import statement in your code to:
+   `from Utils.PLBERT_eu.util import load_plbert`
+6. We used code developed by [Aholab](https://aholab.ehu.eus/aholab/) to generate IPA phonemes for training the model. You can see a demo of the Basque phonemizer at [arrandi/phonemizer-eus-esp](https://huggingface.co/spaces/arrandi/phonemizer-eus-esp). Likewise, the code used to generate IPA phonemes can be found in the `phonemizer` directory. We collapsed multi-character phonemes into single-character phonemes for better grapheme–phoneme alignment.
+**Note:** If second-stage StyleTTS2 training produces a NaN loss when using a single GPU, see [issue #254](https://github.com/yl4579/StyleTTS2/issues/254) in the original StyleTTS2 repository.
+---
+## Training Details
+### Training data
+The model was trained on a Basque corpus phonemized using **Modelo1y2**. It uses a consistent phoneme token set with boundary markers and masking tokens.
+Tokenizer: custom (splits on whitespace)
+Phoneme masking strategy: phoneme-level masking and replacement
+Training steps: 4,000,000
+Precision: mixed-precision (fp16)
+### Training configuration
+Model parameters:
+- Vocabulary size: 178
+- Hidden size: 768
+- Attention heads: 12
+- Intermediate size: 2048
+- Number of layers: 12
+- Max position embeddings: 512
+- Dropout: 0.1
+- Embedding size: 128
+- Number of hidden groups: 1
+- Number of hidden layers per group: 12
+- Inner group number: 1
+- Downscale factor: 1
+Other parameters:
+- Batch size: 32
+- Max mel length: 512
+- Word mask probability: 0.15
+- Phoneme mask probability: 0.1
+- Replacement probability: 0.2
+- Token separator: space
+- Token mask: M
+- Word separator ID: 2
+- Scheduler type: OneCycleLR
+- Learning rate: 0.0002
+- pct_start: 0.1
+- Annealing strategy: cosine annealing
+- div_factor: 25
+- final_div_factor: 10000
+### Evaluation
+The model has been successfully integrated into StyleTTS2, where it enables the synthesis of Basque.
+---
+## Citation
+If this code contributes to your research, please cite the work:
+```
+@misc{aarriandiagaplberteu,
+   title={PL-BERT-eu},
+   author={Ander Arriandiaga and Ibon Saratxaga and Eva Navas and Inma Hernaez},
+   organization={Hitz (Aholab) - EHU},
+   url={https://huggingface.co/langtech-veu/PL-BERT-wp_es},
+   year={2026}
+}
+```
+## Additional Information
+### Author
+Author: [Ander Arriandiaga](https://huggingface.co/arrandi) — Aholab (Hitz), EHU
+### Contact
+For further information, please send an email to <inma.hernaez@ehu.eus>.
+### Copyright
+Copyright(c) 2026 by Aholab, HiTZ.
+### License
+[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.

config.yml ADDED Viewed

	@@ -0,0 +1,59 @@

+# Training configuration for Phoneme Tokenizer - based on WB run ofnglulb
+model_type: "albert"
+log_dir: "Checkpoint_Phoneme_Albert_correct_0002"
+mixed_precision: "fp16"
+data_folder: "wiki_phoneme/eu/dataset_v2_fixed_clean"
+batch_size: 32
+# Align save/log intervals with production
+save_interval: 10000
+log_interval: 1000
+num_process: 1
+# Full training steps from production
+num_steps: 4000000
+# Learning rate and scheduler to match production onecycle
+learning_rate: 0.0002
+alignment_approach: "phoneme"
+# Scheduler configuration
+scheduler_type: onecycle
+warmup_ratio: 0.1
+anneal_strategy: cos
+div_factor: 25
+final_div_factor: 10000
+# Wandb configuration
+wandb:
+  project: "basque-pl-bert"
+  experiment_name: "Phoneme_Albert_correct_phoneme_0002"
+  entity: null
+  tags: ["basque", "phoneme", "albert", "correct"]
+# Dataset parameters
+dataset_params:
+  tokenizer_type: "phoneme"
+  phoneme_tokenizer_path: "tokenizer/token_maps_eu.pkl"
+  tokenizer: "ixa-ehu/berteus-base-cased"
+  token_maps: "token_maps.pkl"
+  token_separator: " "
+  token_mask: "M"
+  word_separator: 2
+  max_mel_length: 512
+  word_mask_prob: 0.15
+  phoneme_mask_prob: 0.1
+  replace_prob: 0.2
+# Model parameters (ALBERT configuration)
+model_params:
+  vocab_size: 178
+  hidden_size: 768
+  num_attention_heads: 12
+  intermediate_size: 2048
+  max_position_embeddings: 512
+  num_hidden_layers: 12
+  dropout: 0.1
+  embedding_size: 128
+  num_hidden_groups: 1
+  num_hidden_layers_per_group: 12
+  inner_group_num: 1
+  down_scale_factor: 1

step_4000000.t7 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cd5f5e669db09e598da990fe4e8897128bd8f7ffa15b877151b15b7521565d4a
+size 533867882

token_maps_eu.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5dbdd445a7d13f965801266cc35655223444ca7519121ef3def67f65dd9ebc34
+size 713702

util.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import os
+import yaml
+import torch
+from transformers import AlbertConfig, AlbertModel
+class CustomAlbert(AlbertModel):
+    def forward(self, *args, **kwargs):
+        # Call the original forward method
+        outputs = super().forward(*args, **kwargs)
+        # Only return the last_hidden_state
+        return outputs.last_hidden_state
+def load_plbert(log_dir):
+    config_path = os.path.join(log_dir, "config.yml")
+    plbert_config = yaml.safe_load(open(config_path))
+    albert_base_configuration = AlbertConfig(**plbert_config['model_params'])
+    bert = CustomAlbert(albert_base_configuration)
+    files = os.listdir(log_dir)
+    ckpts = []
+    for f in os.listdir(log_dir):
+        if f.startswith("step_"): ckpts.append(f)
+    iters = [int(f.split('_')[-1].split('.')[0]) for f in ckpts if os.path.isfile(os.path.join(log_dir, f))]
+    iters = sorted(iters)[-1]
+    try:
+        checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu', weights_only=False)
+    except TypeError:
+        checkpoint = torch.load(log_dir + "/step_" + str(iters) + ".t7", map_location='cpu')
+    state_dict = checkpoint['net']
+    from collections import OrderedDict
+    new_state_dict = OrderedDict()
+    for k, v in state_dict.items():
+        name = k[7:] # remove `module.`
+        if name.startswith('encoder.'):
+            name = name[8:] # remove `encoder.`
+            new_state_dict[name] = v
+    # remove optional keys that may not exist across different checkpoint formats
+    new_state_dict.pop("embeddings.position_ids", None)
+    new_state_dict.pop("position_ids", None)
+    bert.load_state_dict(new_state_dict, strict=False)
+    return bert