File size: 8,823 Bytes
0dce0e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
311721d
0dce0e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef632b
0dce0e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c475b69
0dce0e2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
language: fa
license: apache-2.0
library_name: transformers
pipeline_tag: fill-mask
tags:
  - roberta
  - masked-lm
  - persian
  - farsi
  - ner
  - relation-extraction
model-index:
  - name: persian_roberta_opt_tokenizer
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition (NER)
        dataset:
          name: ARMAN + PEYMA (merged)
          type: ner
          config: fa
        metrics:
          - type: precision
            value: 93.4
          - type: recall
            value: 94.8
          - type: f1
            value: 94.08
      - task:
          type: relation-classification
          name: Relation Extraction
        dataset:
          name: PERLEX
          type: relation-extraction
          config: fa
        metrics:
          - type: f1
            value: 90.0
---

# persian_roberta_opt_tokenizer

A compact RoBERTa-style **Masked Language Model (MLM)** for Persian (Farsi).
We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data.
The model is pre-trained with this tokenizer, optimized for Persian script and evaluated on two downstream tasks:

- **NER** on a **merged ARMAN + PEYMA** corpus  
- **Relation Extraction** on **PERLEX**

Model size and training hyperparameters were kept **identical** to the baselines to ensure fair comparisons.

---

## 1) Model Description

- **Architecture:** RoBERTa-style Transformer for Masked LM  
- **Intended use:** Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning  
- **Vocabulary:** BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation)  
- **Max sequence length:** 256

> The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`.

---

## 2) Architecture and Training Setup

**Backbone (example config):**
- hidden size: 256  
- layers: 6  
- attention heads: 4  
- intermediate size: 1024  
- activation: GELU  
- dropout: 0.1  
- positional embeddings: 514

> Adjust numbers above to your final `config.json` if they differ. All baselines used **the same parameter budget**.

**Pretraining objective:** Masked Language Modeling

**Fine-tuning hyperparameters (shared across all compared models):**
```text
epochs = 3
batch_size = 8
learning_rate = 3e-5
weight_decay = 0.01
max_tokens = 128
optimizer = AdamW
scheduler = linear with warmup (recommended 10% warmup)
seed = 42
```

---

## 3) Data and Tasks

### NER
- **Datasets:** **ARMAN** + **PEYMA**, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently)
- **Preprocessing:** Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces

### Relation Extraction
- **Dataset:** **PERLEX** (Persian Relation Extraction)
- **Entity marking:** special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below

---

## 4) Quantitative Results

### 4.1 NER (ARMAN + PEYMA, merged)

|                     Model | Precision | Recall | F1-Score |
|--------------------------:|----------:|-------:|---------:|
| **Proposed (this model)** | **93.4**  | **94.8** | **94.08** |
|            TooKaBERT-base | 94.9      | 96.2   | 95.5     |
|                    FABERT | 94.1      | 95.3   | 94.7     |

### 4.2 Relation Extraction (PERLEX)

|                     Model | F1-score (%) |
|--------------------------:|-------------:|
| **Proposed (this model)** | **90**       |
|            TooKaBERT-base | 91           |
|                    FABERT | 88           |

> All three models used **identical** hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects.

---

## 5) Usage

### 5.1 Fill-Mask Inference (simple)

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

path = "selfms/persian_roberta_opt_tokenizer"

tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForMaskedLM.from_pretrained(path)
model.eval()

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10)
print(fill(" سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه"))
```

### 5.2 Text-Embedding Inference (simple)

```python
import torch
from transformers import AutoTokenizer, AutoModel

path = "selfms/persian_roberta_opt_tokenizer"
tok = AutoTokenizer.from_pretrained(path)
mdl = AutoModel.from_pretrained(path).eval()

def embed(text):
    with torch.no_grad():
        x = tok(text, return_tensors="pt", truncation=True, max_length=256)
        h = mdl(**x).last_hidden_state
        a = x["attention_mask"].unsqueeze(-1)
        v = (h * a).sum(1) / a.sum(1).clamp(min=1)
        return (v / v.norm(dim=1, keepdim=True)).squeeze(0)  # 1D vector

text = "متن فارسی به بردار 768 بعدی تبدیل میشه"
vec = embed(text)
print(len(vec))
```


### 5.3 Tokenizer Inference (simple)

```python
from transformers import AutoTokenizer

path = "selfms/persian_roberta_opt_tokenizer"
tok = AutoTokenizer.from_pretrained(path)

text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده"

enc = tok(text, return_tensors="pt")
tokens = tok.convert_ids_to_tokens(enc["input_ids"][0])

print("Tokens:", tokens)
print("IDs   :", enc["input_ids"][0].tolist())

```

---

## 6) Comparison with Other Models

Under identical parameter budgets and training settings:

- **NER (ARMAN + PEYMA):** TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 .
- **Relation Extraction (PERLEX):** Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91).

These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone.

---

## 7) Limitations, Bias, and Ethical Considerations

- **Domain bias:** Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon.  
- **Tokenization quirks:** ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality.  
- **Sequence length:** Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory.  
- **Stereotypes/Bias:** As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions.

---

## 8) How to Reproduce

1) Pretrain or load the MLM checkpoint:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer")
mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer")
```

2) Fine-tune for NER/RE with the shared hyperparameters:
```
epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128
```

3) Evaluate:
- NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently)  
- RE: relation-level micro-F1 on PERLEX

---

## 9) Files in the Repository

- `config.json`  
- `model.safetensors` or `pytorch_model.bin`  
- `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json`  
- `vocab.json`, `merges.txt` (BPE)  
- `README.md`, `LICENSE`, `.gitattributes`

> Ensure `mask_token` is set to `<mask>` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box.

---

## 10) Citation

If you use this model, please cite:

```bibtex
@misc{persian_roberta_opt_tokenizer_2025,
  title        = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM},
  author       = {selfms},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}},
  note         = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).}
}
```

---

## 11) License

Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution.


## Metrics & Evaluation Notes
- **NER:** entity-level micro-F1 under the **BIO** tagging scheme.  
- **Relation Extraction (RE):** micro-F1 at relation level.  
- **Sequence length:** model supports up to **512** tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used **256** for efficiency.  


## Model Config Summary
- **Architecture:** RoBERTa-base (12 layers, 12 heads, hidden size **768**, FFN **3072**).
- **Max positions:** 514 (effective input up to 512 tokens).
- **Dropout:** hidden 0.1, attention 0.1.
- **Vocab size:** 48,000 (BPE).
- **Special tokens:** `<s>=0`, `<pad>=1`, `</s>=2`, `<mask>` as mask token.