|
|
--- |
|
|
language: fa |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: fill-mask |
|
|
tags: |
|
|
- roberta |
|
|
- masked-lm |
|
|
- persian |
|
|
- farsi |
|
|
- ner |
|
|
- relation-extraction |
|
|
model-index: |
|
|
- name: persian_roberta_opt_tokenizer |
|
|
results: |
|
|
- task: |
|
|
type: token-classification |
|
|
name: Named Entity Recognition (NER) |
|
|
dataset: |
|
|
name: ARMAN + PEYMA (merged) |
|
|
type: ner |
|
|
config: fa |
|
|
metrics: |
|
|
- type: precision |
|
|
value: 93.4 |
|
|
- type: recall |
|
|
value: 94.8 |
|
|
- type: f1 |
|
|
value: 94.08 |
|
|
- task: |
|
|
type: relation-classification |
|
|
name: Relation Extraction |
|
|
dataset: |
|
|
name: PERLEX |
|
|
type: relation-extraction |
|
|
config: fa |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 90.0 |
|
|
--- |
|
|
|
|
|
# persian_roberta_opt_tokenizer |
|
|
|
|
|
A compact RoBERTa-style **Masked Language Model (MLM)** for Persian (Farsi). |
|
|
We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data. |
|
|
The model is pre-trained with this tokenizer, optimized for Persian script and evaluated on two downstream tasks: |
|
|
|
|
|
- **NER** on a **merged ARMAN + PEYMA** corpus |
|
|
- **Relation Extraction** on **PERLEX** |
|
|
|
|
|
Model size and training hyperparameters were kept **identical** to the baselines to ensure fair comparisons. |
|
|
|
|
|
--- |
|
|
|
|
|
## 1) Model Description |
|
|
|
|
|
- **Architecture:** RoBERTa-style Transformer for Masked LM |
|
|
- **Intended use:** Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning |
|
|
- **Vocabulary:** BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation) |
|
|
- **Max sequence length:** 256 |
|
|
|
|
|
> The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`. |
|
|
|
|
|
--- |
|
|
|
|
|
## 2) Architecture and Training Setup |
|
|
|
|
|
**Backbone (example config):** |
|
|
- hidden size: 256 |
|
|
- layers: 6 |
|
|
- attention heads: 4 |
|
|
- intermediate size: 1024 |
|
|
- activation: GELU |
|
|
- dropout: 0.1 |
|
|
- positional embeddings: 514 |
|
|
|
|
|
> Adjust numbers above to your final `config.json` if they differ. All baselines used **the same parameter budget**. |
|
|
|
|
|
**Pretraining objective:** Masked Language Modeling |
|
|
|
|
|
**Fine-tuning hyperparameters (shared across all compared models):** |
|
|
```text |
|
|
epochs = 3 |
|
|
batch_size = 8 |
|
|
learning_rate = 3e-5 |
|
|
weight_decay = 0.01 |
|
|
max_tokens = 128 |
|
|
optimizer = AdamW |
|
|
scheduler = linear with warmup (recommended 10% warmup) |
|
|
seed = 42 |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 3) Data and Tasks |
|
|
|
|
|
### NER |
|
|
- **Datasets:** **ARMAN** + **PEYMA**, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently) |
|
|
- **Preprocessing:** Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces |
|
|
|
|
|
### Relation Extraction |
|
|
- **Dataset:** **PERLEX** (Persian Relation Extraction) |
|
|
- **Entity marking:** special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below |
|
|
|
|
|
--- |
|
|
|
|
|
## 4) Quantitative Results |
|
|
|
|
|
### 4.1 NER (ARMAN + PEYMA, merged) |
|
|
|
|
|
| Model | Precision | Recall | F1-Score | |
|
|
|--------------------------:|----------:|-------:|---------:| |
|
|
| **Proposed (this model)** | **93.4** | **94.8** | **94.08** | |
|
|
| TooKaBERT-base | 94.9 | 96.2 | 95.5 | |
|
|
| FABERT | 94.1 | 95.3 | 94.7 | |
|
|
|
|
|
### 4.2 Relation Extraction (PERLEX) |
|
|
|
|
|
| Model | F1-score (%) | |
|
|
|--------------------------:|-------------:| |
|
|
| **Proposed (this model)** | **90** | |
|
|
| TooKaBERT-base | 91 | |
|
|
| FABERT | 88 | |
|
|
|
|
|
> All three models used **identical** hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects. |
|
|
|
|
|
--- |
|
|
|
|
|
## 5) Usage |
|
|
|
|
|
### 5.1 Fill-Mask Inference (simple) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline |
|
|
|
|
|
path = "selfms/persian_roberta_opt_tokenizer" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
|
model = AutoModelForMaskedLM.from_pretrained(path) |
|
|
model.eval() |
|
|
|
|
|
fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10) |
|
|
print(fill(" سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه")) |
|
|
``` |
|
|
|
|
|
### 5.2 Text-Embedding Inference (simple) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
path = "selfms/persian_roberta_opt_tokenizer" |
|
|
tok = AutoTokenizer.from_pretrained(path) |
|
|
mdl = AutoModel.from_pretrained(path).eval() |
|
|
|
|
|
def embed(text): |
|
|
with torch.no_grad(): |
|
|
x = tok(text, return_tensors="pt", truncation=True, max_length=256) |
|
|
h = mdl(**x).last_hidden_state |
|
|
a = x["attention_mask"].unsqueeze(-1) |
|
|
v = (h * a).sum(1) / a.sum(1).clamp(min=1) |
|
|
return (v / v.norm(dim=1, keepdim=True)).squeeze(0) # 1D vector |
|
|
|
|
|
text = "متن فارسی به بردار 768 بعدی تبدیل میشه" |
|
|
vec = embed(text) |
|
|
print(len(vec)) |
|
|
``` |
|
|
|
|
|
|
|
|
### 5.3 Tokenizer Inference (simple) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
path = "selfms/persian_roberta_opt_tokenizer" |
|
|
tok = AutoTokenizer.from_pretrained(path) |
|
|
|
|
|
text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده" |
|
|
|
|
|
enc = tok(text, return_tensors="pt") |
|
|
tokens = tok.convert_ids_to_tokens(enc["input_ids"][0]) |
|
|
|
|
|
print("Tokens:", tokens) |
|
|
print("IDs :", enc["input_ids"][0].tolist()) |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 6) Comparison with Other Models |
|
|
|
|
|
Under identical parameter budgets and training settings: |
|
|
|
|
|
- **NER (ARMAN + PEYMA):** TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 . |
|
|
- **Relation Extraction (PERLEX):** Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91). |
|
|
|
|
|
These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone. |
|
|
|
|
|
--- |
|
|
|
|
|
## 7) Limitations, Bias, and Ethical Considerations |
|
|
|
|
|
- **Domain bias:** Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon. |
|
|
- **Tokenization quirks:** ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality. |
|
|
- **Sequence length:** Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory. |
|
|
- **Stereotypes/Bias:** As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions. |
|
|
|
|
|
--- |
|
|
|
|
|
## 8) How to Reproduce |
|
|
|
|
|
1) Pretrain or load the MLM checkpoint: |
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer") |
|
|
mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer") |
|
|
``` |
|
|
|
|
|
2) Fine-tune for NER/RE with the shared hyperparameters: |
|
|
``` |
|
|
epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128 |
|
|
``` |
|
|
|
|
|
3) Evaluate: |
|
|
- NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently) |
|
|
- RE: relation-level micro-F1 on PERLEX |
|
|
|
|
|
--- |
|
|
|
|
|
## 9) Files in the Repository |
|
|
|
|
|
- `config.json` |
|
|
- `model.safetensors` or `pytorch_model.bin` |
|
|
- `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json` |
|
|
- `vocab.json`, `merges.txt` (BPE) |
|
|
- `README.md`, `LICENSE`, `.gitattributes` |
|
|
|
|
|
> Ensure `mask_token` is set to `<mask>` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box. |
|
|
|
|
|
--- |
|
|
|
|
|
## 10) Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{persian_roberta_opt_tokenizer_2025, |
|
|
title = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM}, |
|
|
author = {selfms}, |
|
|
year = {2025}, |
|
|
howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}}, |
|
|
note = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## 11) License |
|
|
|
|
|
Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution. |
|
|
|
|
|
|
|
|
## Metrics & Evaluation Notes |
|
|
- **NER:** entity-level micro-F1 under the **BIO** tagging scheme. |
|
|
- **Relation Extraction (RE):** micro-F1 at relation level. |
|
|
- **Sequence length:** model supports up to **512** tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used **256** for efficiency. |
|
|
|
|
|
|
|
|
## Model Config Summary |
|
|
- **Architecture:** RoBERTa-base (12 layers, 12 heads, hidden size **768**, FFN **3072**). |
|
|
- **Max positions:** 514 (effective input up to 512 tokens). |
|
|
- **Dropout:** hidden 0.1, attention 0.1. |
|
|
- **Vocab size:** 48,000 (BPE). |
|
|
- **Special tokens:** `<s>=0`, `<pad>=1`, `</s>=2`, `<mask>` as mask token. |
|
|
|