BSC-LT
/

PL-BERT-wp-ca

barcelona-supercomputing-center

Model card Files Files and versions

cristinae commited on Mar 6

Commit

569f00e

·

verified ·

1 Parent(s): a0f0166

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -88,7 +88,7 @@ Note: Although this example uses StyleTTS2, the model is compatible with other T
 ### Training data
-The model was trained on a phonemized Catalan corpus (any phonemizer can be used). The dataset includes sentences from speakers across Catalonia, Balearic Islands, and Valencia. It uses a consistent phoneme token set with boundary markers and masking tokens.
 Tokenizer: custom (split using whitespaces)
 Phoneme masking strategy: word-level and phoneme-level masking and replacement

 ### Training data
+The model was trained on a phonemized Catalan corpus (any phonemizer can be used) extracted from the [CATalog](https://huggingface.co/datasets/projecte-aina/CATalog) corpus. The dataset includes sentences from speakers across Catalonia, Balearic Islands, and Valencia. It uses a consistent phoneme token set with boundary markers and masking tokens.
 Tokenizer: custom (split using whitespaces)
 Phoneme masking strategy: word-level and phoneme-level masking and replacement