selfms commited on
Commit
0dce0e2
·
verified ·
1 Parent(s): 2527ccb

Upload 6 files

Browse files
README.md CHANGED
@@ -1,3 +1,274 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fa
3
+ license: apache-2.0
4
+ library_name: transformers
5
+ pipeline_tag: fill-mask
6
+ tags:
7
+ - roberta
8
+ - masked-lm
9
+ - persian
10
+ - farsi
11
+ - ner
12
+ - relation-extraction
13
+ model-index:
14
+ - name: persian_roberta_opt_tokenizer
15
+ results:
16
+ - task:
17
+ type: token-classification
18
+ name: Named Entity Recognition (NER)
19
+ dataset:
20
+ name: ARMAN + PEYMA (merged)
21
+ type: ner
22
+ config: fa
23
+ metrics:
24
+ - type: precision
25
+ value: 93.4
26
+ - type: recall
27
+ value: 94.8
28
+ - type: f1
29
+ value: 94.08
30
+ - task:
31
+ type: relation-classification
32
+ name: Relation Extraction
33
+ dataset:
34
+ name: PERLEX
35
+ type: relation-extraction
36
+ config: fa
37
+ metrics:
38
+ - type: f1
39
+ value: 90.0
40
+ ---
41
+
42
+ # persian_roberta_opt_tokenizer
43
+
44
+ A compact RoBERTa-style **Masked Language Model (MLM)** for Persian (Farsi).
45
+ We trained a Persian BPE tokenizer on a mixed corpus combining formal text with social-media and chat data.
46
+ The model is pre-trained with a BPE tokenizer optimized for Persian script and evaluated on two downstream tasks:
47
+
48
+ - **NER** on a **merged ARMAN + PEYMA** corpus
49
+ - **Relation Extraction** on **PERLEX**
50
+
51
+ Model size and training hyperparameters were kept **identical** to the baselines to ensure fair comparisons.
52
+
53
+ ---
54
+
55
+ ## 1) Model Description
56
+
57
+ - **Architecture:** RoBERTa-style Transformer for Masked LM
58
+ - **Intended use:** Persian text understanding, masked token prediction, and as a backbone for NER/RE fine-tuning
59
+ - **Vocabulary:** BPE with Persian-aware preprocessing (supports ZWNJ and Persian punctuation)
60
+ - **Max sequence length:** 256
61
+
62
+ > The repository name on the Hub should be: `selfms/persian_roberta_opt_tokenizer`.
63
+
64
+ ---
65
+
66
+ ## 2) Architecture and Training Setup
67
+
68
+ **Backbone (example config):**
69
+ - hidden size: 256
70
+ - layers: 6
71
+ - attention heads: 4
72
+ - intermediate size: 1024
73
+ - activation: GELU
74
+ - dropout: 0.1
75
+ - positional embeddings: 514
76
+
77
+ > Adjust numbers above to your final `config.json` if they differ. All baselines used **the same parameter budget**.
78
+
79
+ **Pretraining objective:** Masked Language Modeling
80
+
81
+ **Fine-tuning hyperparameters (shared across all compared models):**
82
+ ```text
83
+ epochs = 3
84
+ batch_size = 8
85
+ learning_rate = 3e-5
86
+ weight_decay = 0.01
87
+ max_tokens = 128
88
+ optimizer = AdamW
89
+ scheduler = linear with warmup (recommended 10% warmup)
90
+ seed = 42
91
+ ```
92
+
93
+ ---
94
+
95
+ ## 3) Data and Tasks
96
+
97
+ ### NER
98
+ - **Datasets:** **ARMAN** + **PEYMA**, merged and standardized to a unified tag set (BIO or BILOU; pick one consistently)
99
+ - **Preprocessing:** Persian normalization (digits, punctuation, ZWNJ), sentence segmentation, max length 128, label alignment with wordpieces
100
+
101
+ ### Relation Extraction
102
+ - **Dataset:** **PERLEX** (Persian Relation Extraction)
103
+ - **Entity marking:** special entity markers in the text (recommended) or span pooling; we used a simple [CLS] pooling baseline in code example below
104
+
105
+ ---
106
+
107
+ ## 4) Quantitative Results
108
+
109
+ ### 4.1 NER (ARMAN + PEYMA, merged)
110
+
111
+ | Model | Precision | Recall | F1-Score |
112
+ |--------------------------:|----------:|-------:|---------:|
113
+ | **Proposed (this model)** | **93.4** | **94.8** | **94.08** |
114
+ | TooKaBERT-base | 94.9 | 96.2 | 95.5 |
115
+ | FABERT | 94.1 | 95.3 | 94.7 |
116
+
117
+ ### 4.2 Relation Extraction (PERLEX)
118
+
119
+ | Model | F1-score (%) |
120
+ |--------------------------:|-------------:|
121
+ | **Proposed (this model)** | **90** |
122
+ | TooKaBERT-base | 91 |
123
+ | FABERT | 88 |
124
+
125
+ > All three models used **identical** hyperparameters, token length, and parameter budgets to isolate architecture/tokenizer effects.
126
+
127
+ ---
128
+
129
+ ## 5) Usage
130
+
131
+ ### 5.1 Fill-Mask Inference (simple)
132
+
133
+ ```python
134
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
135
+
136
+ path = "selfms/persian_roberta_opt_tokenizer"
137
+
138
+ tokenizer = AutoTokenizer.from_pretrained(path)
139
+ model = AutoModelForMaskedLM.from_pretrained(path)
140
+ model.eval()
141
+
142
+ fill = pipeline("fill-mask", model=model, tokenizer=tokenizer, top_k=10)
143
+ print(fill("فنفت سلام کسی تحلیل دقیقی ازاین <mask> داره کی میخواد حرکت کنه"))
144
+ ```
145
+
146
+ ### 5.2 Text-Embedding Inference (simple)
147
+
148
+ ```python
149
+ import torch
150
+ from transformers import AutoTokenizer, AutoModel
151
+
152
+ path = "selfms/persian_roberta_opt_tokenizer"
153
+ tok = AutoTokenizer.from_pretrained(path)
154
+ mdl = AutoModel.from_pretrained(path).eval()
155
+
156
+ def embed(text):
157
+ with torch.no_grad():
158
+ x = tok(text, return_tensors="pt", truncation=True, max_length=256)
159
+ h = mdl(**x).last_hidden_state
160
+ a = x["attention_mask"].unsqueeze(-1)
161
+ v = (h * a).sum(1) / a.sum(1).clamp(min=1)
162
+ return (v / v.norm(dim=1, keepdim=True)).squeeze(0) # 1D vector
163
+
164
+ text = "متن فارسی به بردار 768 بعدی تبدیل میشه"
165
+ vec = embed(text)
166
+ print(len(vec))
167
+ ```
168
+
169
+
170
+ ### 5.3 Tokenizer Inference (simple)
171
+
172
+ ```python
173
+ from transformers import AutoTokenizer
174
+
175
+ path = "selfms/persian_roberta_opt_tokenizer"
176
+ tok = AutoTokenizer.from_pretrained(path)
177
+
178
+ text = "برای tokenizer از پیش پردازش معنایی روی دیتاست ها مختلف خبری و شبکه های اجتماعی استفاده شده"
179
+
180
+ enc = tok(text, return_tensors="pt")
181
+ tokens = tok.convert_ids_to_tokens(enc["input_ids"][0])
182
+
183
+ print("Tokens:", tokens)
184
+ print("IDs :", enc["input_ids"][0].tolist())
185
+
186
+ ```
187
+
188
+ ---
189
+
190
+ ## 6) Comparison with Other Models
191
+
192
+ Under identical parameter budgets and training settings:
193
+
194
+ - **NER (ARMAN + PEYMA):** TooKaBERT achieves the highest F1 (95.5), our model is competitive (94.08) and close to FABERT but slightly lower on F1 | نزدیک به FABERT اما کمی پایین‌تر روی F1 (94.7 in P/R, F1 94.7).
195
+ - **Relation Extraction (PERLEX):** Our model (F1=90) surpasses FABERT (88) and is slightly below TooKaBERT (91).
196
+
197
+ These results suggest the tokenizer/backbone choices here are strong for RE and competitive for NER, especially considering the compact backbone.
198
+
199
+ ---
200
+
201
+ ## 7) Limitations, Bias, and Ethical Considerations
202
+
203
+ - **Domain bias:** Training corpora and NER/RE datasets are news/formal-text heavy; performance may drop on slang, dialects, or domain-specific jargon.
204
+ - **Tokenization quirks:** ZWNJ handling and Persian punctuation are supported, but mixed Persian/English code-switching can degrade quality.
205
+ - **Sequence length:** Experiments reported at `max_tokens=128`. Longer contexts may require re-tuning and more memory.
206
+ - **Stereotypes/Bias:** As with all language models, learned correlations may reflect societal biases. Avoid using outputs as ground truth for sensitive decisions.
207
+
208
+ ---
209
+
210
+ ## 8) How to Reproduce
211
+
212
+ 1) Pretrain or load the MLM checkpoint:
213
+ ```python
214
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
215
+ tok = AutoTokenizer.from_pretrained("selfms/persian_roberta_opt_tokenizer")
216
+ mdl = AutoModelForMaskedLM.from_pretrained("selfms/persian_roberta_opt_tokenizer")
217
+ ```
218
+
219
+ 2) Fine-tune for NER/RE with the shared hyperparameters:
220
+ ```
221
+ epochs=3, batch_size=8, lr=3e-5, weight_decay=0.01, max_tokens=128
222
+ ```
223
+
224
+ 3) Evaluate:
225
+ - NER: token-level Precision/Recall/F1 (micro or macro; report your choice consistently)
226
+ - RE: relation-level micro-F1 on PERLEX
227
+
228
+ ---
229
+
230
+ ## 9) Files in the Repository
231
+
232
+ - `config.json`
233
+ - `model.safetensors` or `pytorch_model.bin`
234
+ - `tokenizer_config.json`, `special_tokens_map.json`, `tokenizer.json`
235
+ - `vocab.json`, `merges.txt` (BPE)
236
+ - `README.md`, `LICENSE`, `.gitattributes`
237
+
238
+ > Ensure `mask_token` is set to `<mask>` and `pipeline_tag: fill-mask` is present so the Hub widget works out-of-the-box.
239
+
240
+ ---
241
+
242
+ ## 10) Citation
243
+
244
+ If you use this model, please cite:
245
+
246
+ ```bibtex
247
+ @misc{persian_roberta_opt_tokenizer_2025,
248
+ title = {persian\_roberta\_opt\_tokenizer: A compact RoBERTa-style Persian Masked LM},
249
+ author = {selfms},
250
+ year = {2025},
251
+ howpublished = {\url{https://huggingface.co/selfms/persian_roberta_opt_tokenizer}},
252
+ note = {Pretrained on Persian text; evaluated on ARMAN+PEYMA (NER) and PERLEX (RE).}
253
+ }
254
+ ```
255
+
256
+ ---
257
+
258
+ ## 11) License
259
+
260
+ Apache-2.0 (recommended). Please verify dataset licenses (ARMAN, PEYMA, PERLEX) before redistribution.
261
+
262
+
263
+ ## Metrics & Evaluation Notes
264
+ - **NER:** entity-level micro-F1 under the **BIO** tagging scheme.
265
+ - **Relation Extraction (RE):** micro-F1 at relation level.
266
+ - **Sequence length:** model supports up to **512** tokens (RoBERTa has 514 positions including special tokens). Evaluations in this report used **256** for efficiency.
267
+
268
+
269
+ ## Model Config Summary
270
+ - **Architecture:** RoBERTa-base (12 layers, 12 heads, hidden size **768**, FFN **3072**).
271
+ - **Max positions:** 514 (effective input up to 512 tokens).
272
+ - **Dropout:** hidden 0.1, attention 0.1.
273
+ - **Vocab size:** 48,000 (BPE).
274
+ - **Special tokens:** `<s>=0`, `<pad>=1`, `</s>=2`, `<mask>` as mask token.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "selfms/persian_roberta_opt_tokenizer",
3
+ "architectures": [
4
+ "RobertaForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "position_embedding_type": "absolute",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.46.3",
24
+ "type_vocab_size": 1,
25
+ "use_cache": true,
26
+ "vocab_size": 48000
27
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:926f4020bf655ad46c98c6d49fa3014d783aa964d42ef51391589c841cb985c3
3
+ size 491846808
special_tokens_map.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "mask_token": {
17
+ "content": "<mask>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "pad_token": {
24
+ "content": "<pad>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "sep_token": {
31
+ "content": "</s>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "unk_token": {
38
+ "content": "<unk>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ }
44
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<pad>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<unk>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "</s>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "eos_token": "</s>",
47
+ "extra_special_tokens": {},
48
+ "mask_token": "<mask>",
49
+ "max_length": null,
50
+ "model_input_names": [
51
+ "input_ids",
52
+ "attention_mask"
53
+ ],
54
+ "model_max_length": 512,
55
+ "pad_to_multiple_of": null,
56
+ "pad_token": "<pad>",
57
+ "pad_token_type_id": 0,
58
+ "padding_side": "right",
59
+ "sep_token": "</s>",
60
+ "stride": 0,
61
+ "tokenizer_class": "PreTrainedTokenizerFast",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "<unk>"
65
+ }