| A monolingual T5 model for Persian trained on OSCAR 21.09 (https://oscar-corpus.com/) corpus with self-supervised method. 35 Gig deduplicated version of Persian data was used for pre-training the model. | |
| It's similar to the English T5 model but just for Persian. You may need to fine-tune it on your specific task. | |
| Example code: | |
| ``` | |
| from transformers import T5ForConditionalGeneration,AutoTokenizer | |
| import torch | |
| model_name = "Ahmad/parsT5-base" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = T5ForConditionalGeneration.from_pretrained(model_name) | |
| input_ids = tokenizer.encode('دانش آموزان به <extra_id_0> میروند و <extra_id_1> میخوانند.', return_tensors='pt') | |
| with torch.no_grad(): | |
| hypotheses = model.generate(input_ids) | |
| for h in hypotheses: | |
| print(tokenizer.decode(h)) | |
| ``` | |
| Steps: 725000 | |
| Accuracy: 0.66 | |
| Training More? | |
| ======== | |
| To train the model further please refer to its github repository at: | |
| https://github.com/puraminy/parsT5 | |