nahiar commited on
Commit
8a7eb1e
·
verified ·
1 Parent(s): 948f460

Initial upload (auto-create if missing)

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: token-classification
7
+ tags:
8
+ - token-classification
9
+ - named-entity-recognition
10
+ - indonesian
11
+ - english
12
+ - multilingual
13
+ - xlm-roberta
14
+ - social-media
15
+ license: apache-2.0
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
+ base_model:
21
+ - FacebookAI/xlm-roberta-base
22
+ ---
23
+
24
+ # 🌍 Multilingual Named Entity Recognition for Social Media
25
+ **Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base**
26
+
27
+ A fine-tuned **XLM-RoBERTa-Base** model for **Named Entity Recognition (NER)** on noisy social media text.
28
+
29
+ This model is optimized for multilingual informal content commonly found on:
30
+ - Twitter / X
31
+ - Instagram
32
+ - TikTok
33
+ - Facebook
34
+ - Online forums
35
+
36
+ It supports both **Bahasa Indonesia** and **English**, making it suitable for moderation systems, social listening, and content intelligence pipelines.
37
+
38
+ ---
39
+
40
+ ## 🔍 Model Overview
41
+
42
+ - **Architecture**: `FacebookAI/xlm-roberta-base`
43
+ - **Task**: Token Classification (NER)
44
+ - **Languages**: Indonesian, English
45
+ - **Domain**: Informal & Social Media Text
46
+ - **Training Date**: 2026-02-26
47
+
48
+ ---
49
+
50
+ ## 🏷️ Supported Entity Labels
51
+
52
+ This model detects the following entity types:
53
+
54
+ | Label | Description |
55
+ |------:|------------|
56
+ | PER | Person |
57
+ | ORG | Organization |
58
+ | NOR | Political Organization |
59
+ | GPE | Geopolitical Entity |
60
+ | LOC | Location |
61
+ | FAC | Facility |
62
+ | LAW | Legal Entity (e.g., Undang-Undang) |
63
+ | EVT | Event |
64
+ | WOA | Work of Art |
65
+
66
+ ### Tagging Scheme
67
+
68
+ BIO tagging format is used:
69
+ - `B-XXX` → Beginning of an entity
70
+ - `I-XXX` → Inside an entity
71
+ - `O` → Outside any entity
72
+
73
+ ---
74
+
75
+ ## 📊 Model Performance
76
+
77
+ Evaluated on held-out validation dataset:
78
+
79
+ | Metric | Score |
80
+ |-----------------|--------|
81
+ | F1 Score | 0.8387 |
82
+ | Precision | 0.8203 |
83
+ | Recall | 0.8580 |
84
+ | Training Loss | 0.0021 |
85
+ | Validation Loss | 0.1310 |
86
+
87
+ **Evaluation Details**
88
+ - Metric computed using `seqeval`
89
+ - Micro-averaged F1 score
90
+ - Validation set contains balanced entity distribution
91
+
92
+ ---
93
+
94
+ ## 🏗️ Training Configuration
95
+
96
+ | Parameter | Value |
97
+ |-------------------|------------------|
98
+ | Base Model | xlm-roberta-base |
99
+ | Training Samples | 695,108 |
100
+ | Validation Samples | 106,197 |
101
+ | Epochs | 5 |
102
+ | Learning Rate | 4e-5 |
103
+ | Batch Size | 32 |
104
+ | Optimizer | AdamW |
105
+ | Scheduler | Linear Warmup |
106
+ | Framework | Hugging Face Transformers |
107
+
108
+ ---
109
+
110
+ ## 🚀 Usage
111
+
112
+ ### Quick Inference (Hugging Face Pipeline)
113
+
114
+ ```python
115
+ from transformers import pipeline
116
+
117
+ ner = pipeline(
118
+ "token-classification",
119
+ model="nahiar/xlm-roberta-ner",
120
+ aggregation_strategy="simple"
121
+ )
122
+
123
+ text_id = "Jokowi menghadiri World Economic Forum di Davos."
124
+ text_en = "Apple is opening a new office in Jakarta next month."
125
+
126
+ print(ner(text_id))
127
+ print(ner(text_en))
128
+ ```
129
+
130
+ ### Aggregation Strategy Notes
131
+ - `"simple"` → Recommended (merges subword tokens)
132
+ - `"first"` → Uses first token representation
133
+ - `"average"` → Averages token scores
134
+ - `"max"` → Takes maximum token score
135
+
136
+ ---
137
+
138
+ ## 🎯 Intended Use Cases
139
+
140
+ - Social media Named Entity Recognition
141
+ - Comment & post filtering
142
+ - Content moderation assistance
143
+ - Political monitoring
144
+ - Brand & organization tracking
145
+ - Multilingual content intelligence systems
146
+
147
+ ---
148
+
149
+ ## ⚠️ Limitations
150
+
151
+ - Supports only the defined entity set:
152
+ `NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA`
153
+ - Not optimized for:
154
+ - Formal academic/legal documents
155
+ - Extremely short or ambiguous messages
156
+ - Heavy slang or sarcastic expressions
157
+ - Performance may degrade on highly code-mixed sentences
158
+ - The model may inherit bias from training data
159
+
160
+ ---
161
+
162
+ ## ⚖️ Ethical Considerations
163
+
164
+ This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.
165
+
166
+ It is not intended to replace human judgment in high-risk or sensitive decision-making systems.
167
+
168
+ Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.
169
+
170
+ ---
171
+
172
+ ## 🖥️ Hardware Recommendations
173
+
174
+ - **Recommended**: GPU (≥ 8GB VRAM) for optimal performance
175
+ - CPU inference supported but slower
176
+ - Compatible with FP16 mixed precision for faster inference
177
+
178
+ ---
179
+
180
+ ## 📜 License
181
+
182
+ Released under the **Apache 2.0 License**.
183
+ Free for commercial and research use.
184
+
185
+ ---
186
+
187
+ ## 📚 Citation
188
+
189
+ ```bibtex
190
+ @misc{hidayatuloh2026multilingualner,
191
+ author = {Nuri Hidayatuloh},
192
+ title = {Multilingual Named Entity Recognition for Social Media},
193
+ year = {2026},
194
+ publisher = {Hugging Face},
195
+ url = {https://huggingface.co/nahiar/xlm-roberta-ner}
196
+ }
197
+ ```
198
+
199
+ ---
200
+
201
+ ## 🙌 Acknowledgements
202
+
203
+ - Hugging Face Transformers
204
+ - Facebook AI Research — XLM-RoBERTa
205
+ - Open-source NLP community
206
+ - Contributors and dataset annotators
.ipynb_checkpoints/eval_results-checkpoint.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ eval_loss = 0.13100967527582094
2
+ f1_score = 0.8387909319899245
3
+ precision = 0.8203654280435229
4
+ recall = 0.8580631307708826
.ipynb_checkpoints/training_progress_scores-checkpoint.csv ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ global_step,train_loss,eval_loss,precision,recall,f1_score
2
+ 392,0.3247712254524231,0.12926855454078087,0.7768145161290323,0.8273566673824351,0.8012893833835916
3
+ 784,0.0024179292377084494,0.11792839991931732,0.8139290958674219,0.833154391238995,0.8234295415959253
4
+ 1176,0.2672019302845001,0.12483000898590454,0.8082470038594353,0.8544127120463818,0.8306889352818372
5
+ 1568,0.018565170466899872,0.12438045413448261,0.8160919540229885,0.853768520506764,0.8345051946689054
6
+ 1960,0.002101300982758403,0.13100967527582094,0.8203654280435229,0.8580631307708826,0.8387909319899245
README.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: token-classification
7
+ tags:
8
+ - token-classification
9
+ - named-entity-recognition
10
+ - indonesian
11
+ - english
12
+ - multilingual
13
+ - xlm-roberta
14
+ - social-media
15
+ license: apache-2.0
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
+ base_model:
21
+ - FacebookAI/xlm-roberta-base
22
+ ---
23
+
24
+ # 🌍 Multilingual Named Entity Recognition for Social Media
25
+ **Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base**
26
+
27
+ A fine-tuned **XLM-RoBERTa-Base** model for **Named Entity Recognition (NER)** on noisy social media text.
28
+
29
+ This model is optimized for multilingual informal content commonly found on:
30
+ - Twitter / X
31
+ - Instagram
32
+ - TikTok
33
+ - Facebook
34
+ - Online forums
35
+
36
+ It supports both **Bahasa Indonesia** and **English**, making it suitable for moderation systems, social listening, and content intelligence pipelines.
37
+
38
+ ---
39
+
40
+ ## 🔍 Model Overview
41
+
42
+ - **Architecture**: `FacebookAI/xlm-roberta-base`
43
+ - **Task**: Token Classification (NER)
44
+ - **Languages**: Indonesian, English
45
+ - **Domain**: Informal & Social Media Text
46
+ - **Training Date**: 2026-02-26
47
+
48
+ ---
49
+
50
+ ## 🏷️ Supported Entity Labels
51
+
52
+ This model detects the following entity types:
53
+
54
+ | Label | Description |
55
+ |------:|------------|
56
+ | PER | Person |
57
+ | ORG | Organization |
58
+ | NOR | Political Organization |
59
+ | GPE | Geopolitical Entity |
60
+ | LOC | Location |
61
+ | FAC | Facility |
62
+ | LAW | Legal Entity (e.g., Undang-Undang) |
63
+ | EVT | Event |
64
+ | WOA | Work of Art |
65
+
66
+ ### Tagging Scheme
67
+
68
+ BIO tagging format is used:
69
+ - `B-XXX` → Beginning of an entity
70
+ - `I-XXX` → Inside an entity
71
+ - `O` → Outside any entity
72
+
73
+ ---
74
+
75
+ ## 📊 Model Performance
76
+
77
+ Evaluated on held-out validation dataset:
78
+
79
+ | Metric | Score |
80
+ |-----------------|--------|
81
+ | F1 Score | 0.8387 |
82
+ | Precision | 0.8203 |
83
+ | Recall | 0.8580 |
84
+ | Training Loss | 0.0021 |
85
+ | Validation Loss | 0.1310 |
86
+
87
+ **Evaluation Details**
88
+ - Metric computed using `seqeval`
89
+ - Micro-averaged F1 score
90
+ - Validation set contains balanced entity distribution
91
+
92
+ ---
93
+
94
+ ## 🏗️ Training Configuration
95
+
96
+ | Parameter | Value |
97
+ |-------------------|------------------|
98
+ | Base Model | xlm-roberta-base |
99
+ | Training Samples | 695,108 |
100
+ | Validation Samples | 106,197 |
101
+ | Epochs | 5 |
102
+ | Learning Rate | 4e-5 |
103
+ | Batch Size | 32 |
104
+ | Optimizer | AdamW |
105
+ | Scheduler | Linear Warmup |
106
+ | Framework | Hugging Face Transformers |
107
+
108
+ ---
109
+
110
+ ## 🚀 Usage
111
+
112
+ ### Quick Inference (Hugging Face Pipeline)
113
+
114
+ ```python
115
+ from transformers import pipeline
116
+
117
+ ner = pipeline(
118
+ "token-classification",
119
+ model="nahiar/xlm-roberta-ner",
120
+ aggregation_strategy="simple"
121
+ )
122
+
123
+ text_id = "Jokowi menghadiri World Economic Forum di Davos."
124
+ text_en = "Apple is opening a new office in Jakarta next month."
125
+
126
+ print(ner(text_id))
127
+ print(ner(text_en))
128
+ ```
129
+
130
+ ### Aggregation Strategy Notes
131
+ - `"simple"` → Recommended (merges subword tokens)
132
+ - `"first"` → Uses first token representation
133
+ - `"average"` → Averages token scores
134
+ - `"max"` → Takes maximum token score
135
+
136
+ ---
137
+
138
+ ## 🎯 Intended Use Cases
139
+
140
+ - Social media Named Entity Recognition
141
+ - Comment & post filtering
142
+ - Content moderation assistance
143
+ - Political monitoring
144
+ - Brand & organization tracking
145
+ - Multilingual content intelligence systems
146
+
147
+ ---
148
+
149
+ ## ⚠️ Limitations
150
+
151
+ - Supports only the defined entity set:
152
+ `NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA`
153
+ - Not optimized for:
154
+ - Formal academic/legal documents
155
+ - Extremely short or ambiguous messages
156
+ - Heavy slang or sarcastic expressions
157
+ - Performance may degrade on highly code-mixed sentences
158
+ - The model may inherit bias from training data
159
+
160
+ ---
161
+
162
+ ## ⚖️ Ethical Considerations
163
+
164
+ This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.
165
+
166
+ It is not intended to replace human judgment in high-risk or sensitive decision-making systems.
167
+
168
+ Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.
169
+
170
+ ---
171
+
172
+ ## 🖥️ Hardware Recommendations
173
+
174
+ - **Recommended**: GPU (≥ 8GB VRAM) for optimal performance
175
+ - CPU inference supported but slower
176
+ - Compatible with FP16 mixed precision for faster inference
177
+
178
+ ---
179
+
180
+ ## 📜 License
181
+
182
+ Released under the **Apache 2.0 License**.
183
+ Free for commercial and research use.
184
+
185
+ ---
186
+
187
+ ## 📚 Citation
188
+
189
+ ```bibtex
190
+ @misc{hidayatuloh2026multilingualner,
191
+ author = {Nuri Hidayatuloh},
192
+ title = {Multilingual Named Entity Recognition for Social Media},
193
+ year = {2026},
194
+ publisher = {Hugging Face},
195
+ url = {https://huggingface.co/nahiar/xlm-roberta-ner}
196
+ }
197
+ ```
198
+
199
+ ---
200
+
201
+ ## 🙌 Acknowledgements
202
+
203
+ - Hugging Face Transformers
204
+ - Facebook AI Research — XLM-RoBERTa
205
+ - Open-source NLP community
206
+ - Contributors and dataset annotators
config.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaForTokenClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "B-EVT",
15
+ "1": "B-GPE",
16
+ "2": "B-LOC",
17
+ "3": "B-PER",
18
+ "4": "B-FAC",
19
+ "5": "B-LAW",
20
+ "6": "B-NOR",
21
+ "7": "B-WOA",
22
+ "8": "B-ORG",
23
+ "9": "I-EVT",
24
+ "10": "I-GPE",
25
+ "11": "I-LOC",
26
+ "12": "I-PER",
27
+ "13": "I-FAC",
28
+ "14": "I-LAW",
29
+ "15": "I-NOR",
30
+ "16": "I-WOA",
31
+ "17": "I-ORG",
32
+ "18": "O"
33
+ },
34
+ "initializer_range": 0.02,
35
+ "intermediate_size": 3072,
36
+ "label2id": {
37
+ "B-EVT": 0,
38
+ "B-FAC": 4,
39
+ "B-GPE": 1,
40
+ "B-LAW": 5,
41
+ "B-LOC": 2,
42
+ "B-NOR": 6,
43
+ "B-ORG": 8,
44
+ "B-PER": 3,
45
+ "B-WOA": 7,
46
+ "I-EVT": 9,
47
+ "I-FAC": 13,
48
+ "I-GPE": 10,
49
+ "I-LAW": 14,
50
+ "I-LOC": 11,
51
+ "I-NOR": 15,
52
+ "I-ORG": 17,
53
+ "I-PER": 12,
54
+ "I-WOA": 16,
55
+ "O": 18
56
+ },
57
+ "layer_norm_eps": 1e-05,
58
+ "max_position_embeddings": 514,
59
+ "model_type": "xlm-roberta",
60
+ "num_attention_heads": 12,
61
+ "num_hidden_layers": 12,
62
+ "output_past": true,
63
+ "pad_token_id": 1,
64
+ "position_embedding_type": "absolute",
65
+ "transformers_version": "4.57.3",
66
+ "type_vocab_size": 1,
67
+ "use_cache": true,
68
+ "vocab_size": 250002
69
+ }
eval_results.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ eval_loss = 0.13100967527582094
2
+ f1_score = 0.8387909319899245
3
+ precision = 0.8203654280435229
4
+ recall = 0.8580631307708826
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b78e1a9ffcb81a75b4968289f6f1a02777f5824acbdda77f88d67193d25cc0a2
3
+ size 1109894716
model_args.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_betas": [0.9, 0.999], "adam_epsilon": 1e-08, "best_model_dir": "../model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "dataset_cache_dir": null, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 100, "evaluate_during_training": true, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": false, "gradient_accumulation_steps": 1, "learning_rate": 4e-05, "local_rank": -1, "logging_steps": 50, "loss_type": null, "loss_args": {}, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 128, "model_name": "xlm-roberta-base", "model_type": "xlmroberta", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 5, "optimizer": "AdamW", "output_dir": "../model", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 62, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": false, "save_model_every_epoch": false, "save_optimizer_and_scheduler": true, "save_steps": -1, "scheduler": "linear_schedule_with_warmup", "silent": false, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": null, "tokenizer_type": null, "train_batch_size": 32, "train_custom_parameters_only": false, "trust_remote_code": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": true, "use_multiprocessing_for_evaluation": true, "wandb_kwargs": {}, "wandb_project": null, "warmup_ratio": 0.06, "warmup_steps": 118, "weight_decay": 0.0, "model_class": "NERModel", "classification_report": false, "labels_list": ["B-EVT", "B-GPE", "B-LOC", "B-PER", "B-FAC", "B-LAW", "B-NOR", "B-WOA", "B-ORG", "I-EVT", "I-GPE", "I-LOC", "I-PER", "I-FAC", "I-LAW", "I-NOR", "I-WOA", "I-ORG", "O"], "lazy_loading": false, "lazy_loading_start_line": 0, "onnx": false, "special_tokens_list": []}
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a0f91a36470f359f68abc07093a9a16b60b342794fbdbdfcf98275d1186f8a2c
3
+ size 2219908235
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf3294d889bf48681ced831367e12487b88118b5ec30b2a9b0b7f2030688db6a
3
+ size 1465
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "do_lower_case": false,
48
+ "eos_token": "</s>",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 512,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "sp_model_kwargs": {},
55
+ "tokenizer_class": "XLMRobertaTokenizer",
56
+ "unk_token": "<unk>"
57
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62d0ba91edf9c6d27b41c769be47f3b02835e7645acd8233dc1ecadb7b7b836c
3
+ size 4113
training_progress_scores.csv ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ global_step,train_loss,eval_loss,precision,recall,f1_score
2
+ 392,0.3247712254524231,0.12926855454078087,0.7768145161290323,0.8273566673824351,0.8012893833835916
3
+ 784,0.0024179292377084494,0.11792839991931732,0.8139290958674219,0.833154391238995,0.8234295415959253
4
+ 1176,0.2672019302845001,0.12483000898590454,0.8082470038594353,0.8544127120463818,0.8306889352818372
5
+ 1568,0.018565170466899872,0.12438045413448261,0.8160919540229885,0.853768520506764,0.8345051946689054
6
+ 1960,0.002101300982758403,0.13100967527582094,0.8203654280435229,0.8580631307708826,0.8387909319899245