Persian Phoneme-Level BERT (training)

This model is based on Vaguye phonemizer for Persian!

GPU RTX A6000 (1x) is used for training, dataset of 1.3 milion sentences, chunked, normalized from wikipedia dataset for Farsi.

ds = load_dataset("wikimedia/wikipedia", "20231101.fa", split="train", streaming=True)

Training run for 305K steps with the following result (around the following values)

Step [305000/2000000], Loss: 1.38343, Vocab Loss: 0.34593, Token Loss: 1.25434

dataset and model is stored in HF: SadeghK/FaPLBERT

For dataset preparation from wikipedia, refer to wikipedia-dataset-styletts2-preparation

wikipedia entry -> [normalize] -> [chunk] -> [hamnevise] -> [Zirneshane]

For further details of preprocessing, training refer to preprocess_fa.ipynb and train_fa.ipynb of SadeghKrmi/PL-BERT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support