|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- fa |
|
|
--- |
|
|
|
|
|
## Persian Phoneme-Level BERT (training) |
|
|
> This model is based on [Vaguye](https://github.com/SadeghKrmi/vaguye) phonemizer for Persian! |
|
|
|
|
|
GPU RTX A6000 (1x) is used for training, dataset of 1.3 milion sentences, chunked, normalized from wikipedia dataset for Farsi. |
|
|
```bash |
|
|
ds = load_dataset("wikimedia/wikipedia", "20231101.fa", split="train", streaming=True) |
|
|
``` |
|
|
|
|
|
Training run for 305K steps with the following result (around the following values) |
|
|
```bash |
|
|
Step [305000/2000000], Loss: 1.38343, Vocab Loss: 0.34593, Token Loss: 1.25434 |
|
|
``` |
|
|
|
|
|
dataset and model is stored in HF: `SadeghK/FaPLBERT` |
|
|
|
|
|
|
|
|
For dataset preparation from wikipedia, refer to `wikipedia-dataset-styletts2-preparation` |
|
|
```bash |
|
|
wikipedia entry -> [normalize] -> [chunk] -> [hamnevise] -> [Zirneshane] |
|
|
``` |
|
|
|
|
|
For further details of preprocessing, training refer to `preprocess_fa.ipynb` and `train_fa.ipynb` of [SadeghKrmi/PL-BERT](https://github.com/SadeghKrmi/FaPLBERT.git) |