Upload 6 files

Browse files

Files changed (6) hide show

README.md +274 -3
config.json +27 -0
model.safetensors +3 -0
special_tokens_map.json +44 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0

README.md CHANGED Viewed

@@ -1,3 +1,274 @@
----
-license: mit
----

+---
+language: fa
+license: apache-2.0
+library_name: transformers
+pipeline_tag: fill-mask
+tags:
+  - roberta
+  - masked-lm
+  - persian
+  - farsi
+  - ner
+  - relation-extraction
+model-index:
+  - name: persian_roberta_opt_tokenizer
+    results:
+      - task:
+          type: token-classification
+          name: Named Entity Recognition (NER)
+        dataset:
+          name: ARMAN + PEYMA (merged)
+          type: ner
+          config: fa
+        metrics:
+          - type: precision
+            value: 93.4
+          - type: recall
+            value: 94.8
+          - type: f1
+            value: 94.08
+      - task:
+          type: relation-classification
+          name: Relation Extraction
+        dataset:
+          name: PERLEX
+          type: relation-extraction
+          config: fa
+        metrics:
+          - type: f1
+            value: 90.0
+---
+# persian_roberta_opt_tokenizer
+A compact RoBERTa-style **Masked Language Model (MLM)** for Persian (Farsi).
+We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data.
+The model is pre-trained with a BPE tokenizer optimized for Persian script and evaluated on two downstream tasks:
+- **NER** on a **merged ARMAN + PEYMA** corpus
+- **Relation Extraction** on **PERLEX**
+Model size and training hyperparameters were kept **identical** to the baselines to ensure fair comparisons.
+---
+## 1) Model Description
+- **Architecture:** RoBERTa-style Transformer for Masked LM
+- **Intended use:** Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning
+- **Vocabulary:** BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation)
+- **Max sequence length:** 256
+> The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`.
+---
+## 2) Architecture and Training Setup
+**Backbone (example config):**
+- hidden size: 256
+- layers: 6
+- attention heads: 4
+- intermediate size: 1024
+- activation: GELU
+- dropout: 0.1
+- positional embeddings: 514
+> Adjust numbers above to your final `config.json` if they differ. All baselines used **the same parameter budget**.
+**Pretraining objective:** Masked Language Modeling
+**Fine-tuning hyperparameters (shared across all compared models):**
+```text
+epochs = 3
+batch_size = 8
+learning_rate = 3e-5
+weight_decay = 0.01
+max_tokens = 128
+optimizer = AdamW
+scheduler = linear with warmup (recommended 10% warmup)
+seed = 42
+```
+---
+## 3) Data and Tasks
+### NER
+- **Datasets:** **ARMAN** + **PEYMA**, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently)
+- **Preprocessing:** Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces
+### Relation Extraction
+- **Dataset:** **PERLEX** (Persian Relation Extraction)
+- **Entity marking:** special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below
+---
+## 4) Quantitative Results
+### 4.1 NER (ARMAN + PEYMA, merged)
+|                     Model | Precision | Recall | F1-Score |
+|--------------------------:|----------:|-------:|---------:|
+| **Proposed (this model)** | **93.4**  | **94.8** | **94.08** |
+|            TooKaBERT-base | 94.9      | 96.2   | 95.5     |
+|                    FABERT | 94.1      | 95.3   | 94.7     |
+### 4.2 Relation Extraction (PERLEX)
+|                     Model | F1-score (%) |
+|--------------------------:|-------------:|
+| **Proposed (this model)** | **90**       |
+|            TooKaBERT-base | 91           |
+|                    FABERT | 88           |
+> All three models used **identical** hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects.
+---
+## 5) Usage
+### 5.1 Fill-Mask Inference (simple)
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
+path = "selfms/persian_roberta_opt_tokenizer"
+tokenizer = AutoTokenizer.from_pretrained(path)
+model = AutoModelForMaskedLM.from_pretrained(path)
+model.eval()
+fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10)
+print(fill("فنفت سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه"))
+```
+### 5.2 Text-Embedding Inference (simple)
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+path = "selfms/persian_roberta_opt_tokenizer"
+tok = AutoTokenizer.from_pretrained(path)
+mdl = AutoModel.from_pretrained(path).eval()
+def embed(text):
+    with torch.no_grad():
+        x = tok(text, return_tensors="pt", truncation=True, max_length=256)
+        h = mdl(**x).last_hidden_state
+        a = x["attention_mask"].unsqueeze(-1)
+        v = (h * a).sum(1) / a.sum(1).clamp(min=1)
+        return (v / v.norm(dim=1, keepdim=True)).squeeze(0)  # 1D vector
+text = "متن فارسی به بردار 768 بعدی تبدیل میشه"
+vec = embed(text)
+print(len(vec))
+```
+### 5.3 Tokenizer Inference (simple)
+```python
+from transformers import AutoTokenizer
+path = "selfms/persian_roberta_opt_tokenizer"
+tok = AutoTokenizer.from_pretrained(path)
+text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده"
+enc = tok(text, return_tensors="pt")
+tokens = tok.convert_ids_to_tokens(enc["input_ids"][0])
+print("Tokens:", tokens)
+print("IDs   :", enc["input_ids"][0].tolist())
+```
+---
+## 6) Comparison with Other Models
+Under identical parameter budgets and training settings:
+- **NER (ARMAN + PEYMA):** TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 | نزدیک به FABERT اما کمی پایین‌تر روی F1 (94.7 in P/R, F1 94.7).
+- **Relation Extraction (PERLEX):** Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91).
+These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone.
+---
+## 7) Limitations, Bias, and Ethical Considerations
+- **Domain bias:** Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon.
+- **Tokenization quirks:** ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality.
+- **Sequence length:** Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory.
+- **Stereotypes/Bias:** As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions.
+---
+## 8) How to Reproduce
+1) Pretrain or load the MLM checkpoint:
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer")
+mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer")
+```
+2) Fine-tune for NER/RE with the shared hyperparameters:
+```
+epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128
+```
+3) Evaluate:
+- NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently)
+- RE: relation-level micro-F1 on PERLEX
+---
+## 9) Files in the Repository
+- `config.json`
+- `model.safetensors` or `pytorch_model.bin`
+- `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json`
+- `vocab.json`, `merges.txt` (BPE)
+- `README.md`, `LICENSE`, `.gitattributes`
+> Ensure `mask_token` is set to `<mask>` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box.
+---
+## 10) Citation
+If you use this model, please cite:
+```bibtex
+@misc{persian_roberta_opt_tokenizer_2025,
+  title        = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM},
+  author       = {selfms},
+  year         = {2025},
+  howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}},
+  note         = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).}
+}
+```
+---
+## 11) License
+Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution.
+## Metrics & Evaluation Notes
+- **NER:** entity-level micro-F1 under the **BIO** tagging scheme.
+- **Relation Extraction (RE):** micro-F1 at relation level.
+- **Sequence length:** model supports up to **512** tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used **256** for efficiency.
+## Model Config Summary
+- **Architecture:** RoBERTa-base (12 layers, 12 heads, hidden size **768**, FFN **3072**).
+- **Max positions:** 514 (effective input up to 512 tokens).
+- **Dropout:** hidden 0.1, attention 0.1.
+- **Vocab size:** 48,000 (BPE).
+- **Special tokens:** `<s>=0`, `<pad>=1`, `</s>=2`, `<mask>` as mask token.

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "_name_or_path": "selfms/persian_roberta_opt_tokenizer",
+  "architectures": [
+    "RobertaForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.3",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 48000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:926f4020bf655ad46c98c6d49fa3014d783aa964d42ef51391589c841cb985c3
+size 491846808

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<mask>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "max_length": null,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}