Persian Phoneme-Level BERT (training)
This model is based on Vaguye phonemizer for Persian!
GPU RTX A6000 (1x) is used for training, dataset of 1.3 milion sentences, chunked, normalized from wikipedia dataset for Farsi.
ds = load_dataset("wikimedia/wikipedia", "20231101.fa", split="train", streaming=True)
Training run for 305K steps with the following result (around the following values)
Step [305000/2000000], Loss: 1.38343, Vocab Loss: 0.34593, Token Loss: 1.25434
dataset and model is stored in HF: SadeghK/FaPLBERT
For dataset preparation from wikipedia, refer to wikipedia-dataset-styletts2-preparation
wikipedia entry -> [normalize] -> [chunk] -> [hamnevise] -> [Zirneshane]
For further details of preprocessing, training refer to preprocess_fa.ipynb and train_fa.ipynb of SadeghKrmi/PL-BERT
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support