persian_roberta_opt_tokenizer / README.md

update readme

311721d verified 3 months ago

8.82 kB

	---
	language: fa
	license: apache-2.0
	library_name: transformers
	pipeline_tag: fill-mask
	tags:
	- roberta
	- masked-lm
	- persian
	- farsi
	- ner
	- relation-extraction
	model-index:
	- name: persian_roberta_opt_tokenizer
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition (NER)
	dataset:
	name: ARMAN + PEYMA (merged)
	type: ner
	config: fa
	metrics:
	- type: precision
	value: 93.4
	- type: recall
	value: 94.8
	- type: f1
	value: 94.08
	- task:
	type: relation-classification
	name: Relation Extraction
	dataset:
	name: PERLEX
	type: relation-extraction
	config: fa
	metrics:
	- type: f1
	value: 90.0
	---

	# persian_roberta_opt_tokenizer

	A compact RoBERTa-style Masked Language Model (MLM) for Persian (Farsi).
	We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data.
	The model is pre-trained with this tokenizer, optimized for Persian script and evaluated on two downstream tasks:

	- NER on a merged ARMAN + PEYMA corpus
	- Relation Extraction on PERLEX

	Model size and training hyperparameters were kept identical to the baselines to ensure fair comparisons.

	---

	## 1) Model Description

	- Architecture: RoBERTa-style Transformer for Masked LM
	- Intended use: Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning
	- Vocabulary: BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation)
	- Max sequence length: 256

	> The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`.

	---

	## 2) Architecture and Training Setup

	Backbone (example config):
	- hidden size: 256
	- layers: 6
	- attention heads: 4
	- intermediate size: 1024
	- activation: GELU
	- dropout: 0.1
	- positional embeddings: 514

	> Adjust numbers above to your final `config.json` if they differ. All baselines used the same parameter budget.

	Pretraining objective: Masked Language Modeling

	Fine-tuning hyperparameters (shared across all compared models):
	```text
	epochs = 3
	batch_size = 8
	learning_rate = 3e-5
	weight_decay = 0.01
	max_tokens = 128
	optimizer = AdamW
	scheduler = linear with warmup (recommended 10% warmup)
	seed = 42
	```

	---

	## 3) Data and Tasks

	### NER
	- Datasets: ARMAN + PEYMA, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently)
	- Preprocessing: Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces

	### Relation Extraction
	- Dataset: PERLEX (Persian Relation Extraction)
	- Entity marking: special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below

	---

	## 4) Quantitative Results

	### 4.1 NER (ARMAN + PEYMA, merged)

	\| Model \| Precision \| Recall \| F1-Score \|
	\|--------------------------:\|----------:\|-------:\|---------:\|
	\| Proposed (this model) \| 93.4 \| 94.8 \| 94.08 \|
	\| TooKaBERT-base \| 94.9 \| 96.2 \| 95.5 \|
	\| FABERT \| 94.1 \| 95.3 \| 94.7 \|

	### 4.2 Relation Extraction (PERLEX)

	\| Model \| F1-score (%) \|
	\|--------------------------:\|-------------:\|
	\| Proposed (this model) \| 90 \|
	\| TooKaBERT-base \| 91 \|
	\| FABERT \| 88 \|

	> All three models used identical hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects.

	---

	## 5) Usage

	### 5.1 Fill-Mask Inference (simple)

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

	path = "selfms/persian_roberta_opt_tokenizer"

	tokenizer = AutoTokenizer.from_pretrained(path)
	model = AutoModelForMaskedLM.from_pretrained(path)
	model.eval()

	fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10)
	print(fill(" سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه"))
	```

	### 5.2 Text-Embedding Inference (simple)

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	path = "selfms/persian_roberta_opt_tokenizer"
	tok = AutoTokenizer.from_pretrained(path)
	mdl = AutoModel.from_pretrained(path).eval()

	def embed(text):
	with torch.no_grad():
	x = tok(text, return_tensors="pt", truncation=True, max_length=256)
	h = mdl(**x).last_hidden_state
	a = x["attention_mask"].unsqueeze(-1)
	v = (h * a).sum(1) / a.sum(1).clamp(min=1)
	return (v / v.norm(dim=1, keepdim=True)).squeeze(0) # 1D vector

	text = "متن فارسی به بردار 768 بعدی تبدیل میشه"
	vec = embed(text)
	print(len(vec))
	```


	### 5.3 Tokenizer Inference (simple)

	```python
	from transformers import AutoTokenizer

	path = "selfms/persian_roberta_opt_tokenizer"
	tok = AutoTokenizer.from_pretrained(path)

	text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده"

	enc = tok(text, return_tensors="pt")
	tokens = tok.convert_ids_to_tokens(enc["input_ids"][0])

	print("Tokens:", tokens)
	print("IDs :", enc["input_ids"][0].tolist())

	```

	---

	## 6) Comparison with Other Models

	Under identical parameter budgets and training settings:

	- NER (ARMAN + PEYMA): TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 .
	- Relation Extraction (PERLEX): Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91).

	These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone.

	---

	## 7) Limitations, Bias, and Ethical Considerations

	- Domain bias: Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon.
	- Tokenization quirks: ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality.
	- Sequence length: Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory.
	- Stereotypes/Bias: As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions.

	---

	## 8) How to Reproduce

	1) Pretrain or load the MLM checkpoint:
	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer
	tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer")
	mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer")
	```

	2) Fine-tune for NER/RE with the shared hyperparameters:
	```
	epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128
	```

	3) Evaluate:
	- NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently)
	- RE: relation-level micro-F1 on PERLEX

	---

	## 9) Files in the Repository

	- `config.json`
	- `model.safetensors` or `pytorch_model.bin`
	- `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json`
	- `vocab.json`, `merges.txt` (BPE)
	- `README.md`, `LICENSE`, `.gitattributes`

	> Ensure `mask_token` is set to `<mask>` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box.

	---

	## 10) Citation

	If you use this model, please cite:

	```bibtex
	@misc{persian_roberta_opt_tokenizer_2025,
	title = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM},
	author = {selfms},
	year = {2025},
	howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}},
	note = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).}
	}
	```

	---

	## 11) License

	Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution.


	## Metrics & Evaluation Notes
	- NER: entity-level micro-F1 under the BIO tagging scheme.
	- Relation Extraction (RE): micro-F1 at relation level.
	- Sequence length: model supports up to 512 tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used 256 for efficiency.


	## Model Config Summary
	- Architecture: RoBERTa-base (12 layers, 12 heads, hidden size 768, FFN 3072).
	- Max positions: 514 (effective input up to 512 tokens).
	- Dropout: hidden 0.1, attention 0.1.
	- Vocab size: 48,000 (BPE).
	- Special tokens: `<s>=0`, `<pad>=1`, `</s>=2`, `<mask>` as mask token.