Update README.md
Browse files
README.md
CHANGED
|
@@ -23,4 +23,105 @@ widget:
|
|
| 23 |
example_title: "Formal 1"
|
| 24 |
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
|
| 25 |
example_title: "Formal 2"
|
| 26 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
example_title: "Formal 1"
|
| 24 |
- text: "در حکم اولیه این شرکت مجاز به فعالیت شد ولی پس از بررسی مجدد، مجوز این شرکت [MASK] شد."
|
| 25 |
example_title: "Formal 2"
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
# FaBERT: Pre-training BERT on Persian Blogs
|
| 30 |
+
|
| 31 |
+
## Model Details
|
| 32 |
+
|
| 33 |
+
FaBERT is a Persian BERT-base model trained on the diverse HmBlogs corpus, encompassing both casual and formal Persian texts. Developed for natural language processing tasks, FaBERT is a robust solution for processing Persian text. Through evaluation across various Natural Language Understanding (NLU) tasks, FaBERT consistently demonstrates notable improvements, while having a compact model size. Now available on Hugging Face, integrating FaBERT into your projects is hassle-free. Experience enhanced performance without added complexity as FaBERT tackles a variety of NLP tasks.
|
| 34 |
+
|
| 35 |
+
## Features
|
| 36 |
+
- Pre-trained on the diverse HmBlogs corpus consisting more than 50 GB of text from Persian Blogs
|
| 37 |
+
- Remarkable performance across various downstream NLP tasks
|
| 38 |
+
- BERT architecture with 124 million parameters
|
| 39 |
+
|
| 40 |
+
## Useful Links
|
| 41 |
+
- **Repository:** [FaBERT on Github](https://github.com/SBU-NLP-LAB/FaBERT)
|
| 42 |
+
- **Paper:** [arXiv preprint](https://arxiv.org/abs/2402.06617)
|
| 43 |
+
|
| 44 |
+
## Usage
|
| 45 |
+
|
| 46 |
+
### Loading the Model with MLM head
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 50 |
+
|
| 51 |
+
tokenizer = AutoTokenizer.from_pretrained("sbunlp/fabert") # make sure to use the default fast tokenizer
|
| 52 |
+
model = AutoModelForMaskedLM.from_pretrained("sbunlp/fabert")
|
| 53 |
+
```
|
| 54 |
+
### Downstream Tasks
|
| 55 |
+
|
| 56 |
+
Similar to the original English BERT, FaBERT can be fine-tuned on many downstream tasks.(https://huggingface.co/docs/transformers/en/training)
|
| 57 |
+
|
| 58 |
+
Examples on Persian datasets are available in our [GitHub repository](#useful-links).
|
| 59 |
+
|
| 60 |
+
**make sure to use the default Fast Tokenizer**
|
| 61 |
+
|
| 62 |
+
## Training Details
|
| 63 |
+
|
| 64 |
+
FaBERT was pre-trained with the MLM (WWM) objective, and the resulting perplexity on validation set was 7.76.
|
| 65 |
+
|
| 66 |
+
| Hyperparameter | Value |
|
| 67 |
+
|-------------------|:--------------:|
|
| 68 |
+
| Batch Size | 32 |
|
| 69 |
+
| Optimizer | Adam |
|
| 70 |
+
| Learning Rate | 6e-5 |
|
| 71 |
+
| Weight Decay | 0.01 |
|
| 72 |
+
| Total Steps | 18 Million |
|
| 73 |
+
| Warmup Steps | 1.8 Million |
|
| 74 |
+
| Precision Format | TF32 |
|
| 75 |
+
|
| 76 |
+
## Evaluation
|
| 77 |
+
|
| 78 |
+
Here are some key performance results for the FaBERT model:
|
| 79 |
+
|
| 80 |
+
**Sentiment Analysis**
|
| 81 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
| 82 |
+
|:-------------|:------:|:--------:|:-----:|
|
| 83 |
+
| MirasOpinion | **87.51** | 86.73 | 84.92 |
|
| 84 |
+
| MirasIrony | 74.82 | 71.08 | **75.51** |
|
| 85 |
+
| DeepSentiPers | **79.85** | 74.94 | 79.00 |
|
| 86 |
+
|
| 87 |
+
**Named Entity Recognition**
|
| 88 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
| 89 |
+
|:-------------|:------:|:--------:|:-----:|
|
| 90 |
+
| PEYMA | **91.39** | 91.24 | 90.91 |
|
| 91 |
+
| ParsTwiner | **82.22** | 81.13 | 79.50 |
|
| 92 |
+
| MultiCoNER v2 | 57.92 | **58.09** | 51.47 |
|
| 93 |
+
|
| 94 |
+
**Question Answering**
|
| 95 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
| 96 |
+
|:-------------|:------:|:--------:|:-----:|
|
| 97 |
+
| ParsiNLU | **55.87** | 44.89 | 42.55 |
|
| 98 |
+
| PQuAD | 87.34 | 86.89 | **87.60** |
|
| 99 |
+
| PCoQA | **53.51** | 50.96 | 51.12 |
|
| 100 |
+
|
| 101 |
+
**Natural Language Inference & QQP**
|
| 102 |
+
| Task | FaBERT | ParsBERT | XLM-R |
|
| 103 |
+
|:-------------|:------:|:--------:|:-----:|
|
| 104 |
+
| FarsTail | **84.45** | 82.52 | 83.50 |
|
| 105 |
+
| SBU-NLI | **66.65** | 58.41 | 58.85 |
|
| 106 |
+
| ParsiNLU QQP | **82.62** | 77.60 | 79.74 |
|
| 107 |
+
|
| 108 |
+
**Number of Parameters**
|
| 109 |
+
| | FaBERT | ParsBERT | XLM-R |
|
| 110 |
+
|:-------------|:------:|:--------:|:-----:|
|
| 111 |
+
| Parameter Count (M) | 124 | 162 | 278 |
|
| 112 |
+
| Vocabulary Size (K) | 50 | 100 | 250 |
|
| 113 |
+
|
| 114 |
+
For a more detailed performance analysis refer to the paper.
|
| 115 |
+
|
| 116 |
+
## How to Cite
|
| 117 |
+
|
| 118 |
+
If you use FaBERT in your research or projects, please cite it using the following BibTeX:
|
| 119 |
+
|
| 120 |
+
```bibtex
|
| 121 |
+
@article{masumi2024fabert,
|
| 122 |
+
title={FaBERT: Pre-training BERT on Persian Blogs},
|
| 123 |
+
author={Masumi, Mostafa and Majd, Seyed Soroush and Shamsfard, Mehrnoush and Beigy, Hamid},
|
| 124 |
+
journal={arXiv preprint arXiv:2402.06617},
|
| 125 |
+
year={2024}
|
| 126 |
+
}
|
| 127 |
+
```
|