File size: 8,823 Bytes

---
language: fa
license: apache-2.0
library_name: transformers
pipeline_tag: fill-mask
tags:
  - roberta
  - masked-lm
  - persian
  - farsi
  - ner
  - relation-extraction
model-index:
  - name: persian_roberta_opt_tokenizer
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition (NER)
        dataset:
          name: ARMAN + PEYMA (merged)
          type: ner
          config: fa
        metrics:
          - type: precision
            value: 93.4
          - type: recall
            value: 94.8
          - type: f1
            value: 94.08
      - task:
          type: relation-classification
          name: Relation Extraction
        dataset:
          name: PERLEX
          type: relation-extraction
          config: fa
        metrics:
          - type: f1
            value: 90.0
---

# persian_roberta_opt_tokenizer

A compact RoBERTa-style **Masked Language Model (MLM)** for Persian (Farsi).
We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data.
The model is pre-trained with this tokenizer, optimized for Persian script and evaluated on two downstream tasks:

- **NER** on a **merged ARMAN + PEYMA** corpus  
- **Relation Extraction** on **PERLEX**

Model size and training hyperparameters were kept **identical** to the baselines to ensure fair comparisons.

---

## 1) Model Description

- **Architecture:** RoBERTa-style Transformer for Masked LM  
- **Intended use:** Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning  
- **Vocabulary:** BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation)  
- **Max sequence length:** 256

> The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`.

---

## 2) Architecture and Training Setup

**Backbone (example config):**
- hidden size: 256  
- layers: 6  
- attention heads: 4  
- intermediate size: 1024  
- activation: GELU  
- dropout: 0.1  
- positional embeddings: 514

> Adjust numbers above to your final `config.json` if they differ. All baselines used **the same parameter budget**.

**Pretraining objective:** Masked Language Modeling

**Fine-tuning hyperparameters (shared across all compared models):**
```text
epochs = 3
batch_size = 8
learning_rate = 3e-5
weight_decay = 0.01
max_tokens = 128
optimizer = AdamW
scheduler = linear with warmup (recommended 10% warmup)
seed = 42
```

---

## 3) Data and Tasks

### NER
- **Datasets:** **ARMAN** + **PEYMA**, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently)
- **Preprocessing:** Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces

### Relation Extraction
- **Dataset:** **PERLEX** (Persian Relation Extraction)
- **Entity marking:** special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below

---

## 4) Quantitative Results

### 4.1 NER (ARMAN + PEYMA, merged)

|                     Model | Precision | Recall | F1-Score |
|--------------------------:|----------:|-------:|---------:|
| **Proposed (this model)** | **93.4**  | **94.8** | **94.08** |
|            TooKaBERT-base | 94.9      | 96.2   | 95.5     |
|                    FABERT | 94.1      | 95.3   | 94.7     |

### 4.2 Relation Extraction (PERLEX)

|                     Model | F1-score (%) |
|--------------------------:|-------------:|
| **Proposed (this model)** | **90**       |
|            TooKaBERT-base | 91           |
|                    FABERT | 88           |

> All three models used **identical** hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects.

---

## 5) Usage

### 5.1 Fill-Mask Inference (simple)

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

path = "selfms/persian_roberta_opt_tokenizer"

tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForMaskedLM.from_pretrained(path)
model.eval()

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10)
print(fill(" سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه"))
```

### 5.2 Text-Embedding Inference (simple)

```python
import torch
from transformers import AutoTokenizer, AutoModel

path = "selfms/persian_roberta_opt_tokenizer"
tok = AutoTokenizer.from_pretrained(path)
mdl = AutoModel.from_pretrained(path).eval()

def embed(text):
    with torch.no_grad():
        x = tok(text, return_tensors="pt", truncation=True, max_length=256)
        h = mdl(**x).last_hidden_state
        a = x["attention_mask"].unsqueeze(-1)
        v = (h * a).sum(1) / a.sum(1).clamp(min=1)
        return (v / v.norm(dim=1, keepdim=True)).squeeze(0)  # 1D vector

text = "متن فارسی به بردار 768 بعدی تبدیل میشه"
vec = embed(text)
print(len(vec))
```


### 5.3 Tokenizer Inference (simple)

```python
from transformers import AutoTokenizer

path = "selfms/persian_roberta_opt_tokenizer"
tok = AutoTokenizer.from_pretrained(path)

text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده"

enc = tok(text, return_tensors="pt")
tokens = tok.convert_ids_to_tokens(enc["input_ids"][0])

print("Tokens:", tokens)
print("IDs   :", enc["input_ids"][0].tolist())

```

---

## 6) Comparison with Other Models

Under identical parameter budgets and training settings:

- **NER (ARMAN + PEYMA):** TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 .
- **Relation Extraction (PERLEX):** Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91).

These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone.

---

## 7) Limitations, Bias, and Ethical Considerations

- **Domain bias:** Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon.  
- **Tokenization quirks:** ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality.  
- **Sequence length:** Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory.  
- **Stereotypes/Bias:** As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions.

---

## 8) How to Reproduce

1) Pretrain or load the MLM checkpoint:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer")
mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer")
```

2) Fine-tune for NER/RE with the shared hyperparameters:
```
epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128
```

3) Evaluate:
- NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently)  
- RE: relation-level micro-F1 on PERLEX

---

## 9) Files in the Repository

- `config.json`  
- `model.safetensors` or `pytorch_model.bin`  
- `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json`  
- `vocab.json`, `merges.txt` (BPE)  
- `README.md`, `LICENSE`, `.gitattributes`

> Ensure `mask_token` is set to `<mask>` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box.

---

## 10) Citation

If you use this model, please cite:

```bibtex
@misc{persian_roberta_opt_tokenizer_2025,
  title        = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM},
  author       = {selfms},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}},
  note         = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).}
}
```

---

## 11) License

Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution.


## Metrics & Evaluation Notes
- **NER:** entity-level micro-F1 under the **BIO** tagging scheme.  
- **Relation Extraction (RE):** micro-F1 at relation level.  
- **Sequence length:** model supports up to **512** tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used **256** for efficiency.  


## Model Config Summary
- **Architecture:** RoBERTa-base (12 layers, 12 heads, hidden size **768**, FFN **3072**).
- **Max positions:** 514 (effective input up to 512 tokens).
- **Dropout:** hidden 0.1, attention 0.1.
- **Vocab size:** 48,000 (BPE).
- **Special tokens:** `<s>=0`, `<pad>=1`, `</s>=2`, `<mask>` as mask token.