rollerhafeezh-amikom commited on
Commit
a662f31
·
1 Parent(s): b39760b
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: xlm-roberta-base
4
+ tags:
5
+ - silvanus
6
+ metrics:
7
+ - precision
8
+ - recall
9
+ - f1
10
+ - accuracy
11
+ model-index:
12
+ - name: xlm-roberta-base-ner-silvanus
13
+ results:
14
+ - task:
15
+ name: Token Classification
16
+ type: token-classification
17
+ dataset:
18
+ name: id_nergrit_corpus
19
+ type: id_nergrit_corpus
20
+ config: ner
21
+ split: validation
22
+ args: ner
23
+ metrics:
24
+ - name: Precision
25
+ type: precision
26
+ value: 0.918918918918919
27
+ - name: Recall
28
+ type: recall
29
+ value: 0.9272727272727272
30
+ - name: F1
31
+ type: f1
32
+ value: 0.9230769230769231
33
+ - name: Accuracy
34
+ type: accuracy
35
+ value: 0.9858518778229216
36
+ language:
37
+ - id
38
+ - en
39
+ - es
40
+ - it
41
+ - sk
42
+ pipeline_tag: token-classification
43
+ widget:
44
+ - text: >-
45
+ Kebakaran hutan dan lahan terus terjadi dan semakin meluas di Kota
46
+ Palangkaraya, Kalimantan Tengah (Kalteng) pada hari Rabu, 15 Nopember 2023
47
+ 20.00 WIB. Bahkan kobaran api mulai membakar pondok warga dan mendekati
48
+ permukiman. BZK #RCTINews #SeputariNews #News #Karhutla #KebakaranHutan
49
+ #HutanKalimantan #SILVANUS_Italian_Pilot_Testing
50
+ example_title: Indonesia
51
+ - text: >-
52
+ Wildfire rages for a second day in Evia destroying a Natura 2000 protected
53
+ pine forest. - 5:51 PM Aug 14, 2019
54
+ example_title: English
55
+ - text: >-
56
+ 3 nov 2023 21:57 - Incendio forestal obliga a la evacuación de hasta 850
57
+ personas cerca del pueblo de Montichelvo en Valencia.
58
+ example_title: Spanish
59
+ - text: >-
60
+ Incendi boschivi nell'est del Paese: 2 morti e oltre 50 case distrutte nello
61
+ stato del Queensland.
62
+ example_title: Italian
63
+ - text: >-
64
+ Lesné požiare na Sicílii si vyžiadali dva ľudské životy a evakuáciu hotela
65
+ http://dlvr.it/SwW3sC - 23. septembra 2023 20:57
66
+ example_title: Slovak
67
+ ---
68
+
69
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
70
+ should probably proofread and complete it, then remove this comment. -->
71
+
72
+ # xlm-roberta-base-ner-silvanus
73
+
74
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the Indonesian NER dataset.
75
+ It achieves the following results on the evaluation set:
76
+ - Loss: 0.0567
77
+ - Precision: 0.9189
78
+ - Recall: 0.9273
79
+ - F1: 0.9231
80
+ - Accuracy: 0.9859
81
+
82
+ ## Model description
83
+
84
+ The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
85
+
86
+ - **Developed by:** See [associated paper](https://arxiv.org/abs/1911.02116)
87
+ - **Model type:** Multi-lingual model
88
+ - **Language(s) (NLP) or Countries (images):** XLM-RoBERTa is a multilingual model trained on 100 different languages; see [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr) for full list; model is fine-tuned on a dataset in English
89
+ - **License:** More information needed
90
+ - **Related Models:** [RoBERTa](https://huggingface.co/roberta-base), [XLM](https://huggingface.co/docs/transformers/model_doc/xlm)
91
+ - **Parent Model:** [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base)
92
+ - **Resources for more information:** [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr)
93
+
94
+ ## Intended uses & limitations
95
+
96
+ This model can be used to extract multilingual information such as location, date and time on social media (Twitter, etc.). This model is limited by an Indonesian language training data set to be tested in 4 languages (English, Spanish, Italian and Slovak) using zero-shot transfer learning techniques to extract multilingual information.
97
+
98
+ ## Training and evaluation data
99
+
100
+ This model was fine-tuned on Indonesian NER datasets.
101
+ Abbreviation|Description
102
+ -|-
103
+ O|Outside of a named entity
104
+ B-LOC |Beginning of a location right after another location
105
+ I-LOC |Location
106
+ B-DAT |Beginning of a date right after another date
107
+ I-DAT |Date
108
+ B-TIM |Beginning of a time right after another time
109
+ I-TIM |Time
110
+
111
+ ## Training procedure
112
+
113
+ ### Training hyperparameters
114
+
115
+ The following hyperparameters were used during training:
116
+ - learning_rate: 2e-05
117
+ - train_batch_size: 8
118
+ - eval_batch_size: 8
119
+ - seed: 42
120
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
121
+ - lr_scheduler_type: linear
122
+ - num_epochs: 3
123
+
124
+ ### Training results
125
+
126
+ | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
127
+ |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
128
+ | 0.1394 | 1.0 | 827 | 0.0559 | 0.8808 | 0.9257 | 0.9027 | 0.9842 |
129
+ | 0.0468 | 2.0 | 1654 | 0.0575 | 0.9107 | 0.9190 | 0.9148 | 0.9849 |
130
+ | 0.0279 | 3.0 | 2481 | 0.0567 | 0.9189 | 0.9273 | 0.9231 | 0.9859 |
131
+
132
+
133
+ ### Framework versions
134
+
135
+ - Transformers 4.35.0
136
+ - Pytorch 2.1.0+cu118
137
+ - Datasets 2.14.6
138
+ - Tokenizers 0.14.1
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-base",
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "O",
15
+ "1": "B-LOC",
16
+ "2": "I-LOC",
17
+ "3": "B-DAT",
18
+ "4": "I-DAT",
19
+ "5": "B-TIM",
20
+ "6": "I-TIM"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 3072,
24
+ "label2id": {
25
+ "B-DAT": 3,
26
+ "B-LOC": 1,
27
+ "B-TIM": 5,
28
+ "I-DAT": 4,
29
+ "I-LOC": 2,
30
+ "I-TIM": 6,
31
+ "O": 0
32
+ },
33
+ "layer_norm_eps": 1e-05,
34
+ "max_position_embeddings": 514,
35
+ "model_type": "xlm-roberta",
36
+ "num_attention_heads": 12,
37
+ "num_hidden_layers": 12,
38
+ "output_past": true,
39
+ "pad_token_id": 1,
40
+ "position_embedding_type": "absolute",
41
+ "torch_dtype": "float32",
42
+ "transformers_version": "4.35.0",
43
+ "type_vocab_size": 1,
44
+ "use_cache": true,
45
+ "vocab_size": 250002
46
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:92cd496a4ea4ccfb04c64c5c5c4cb6210ae5ad1f89b56f42248b046fc8ad896b
3
+ size 1109857804
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7bc482bbe4d17395039e8c52dfd60092475c46bdcd8cd16c7238b2c522cd2c8
3
+ size 17082941
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65fcda7383fe94a15e2b9530f5b4601fed9001ccb72e783caca56685b97ffd99
3
+ size 4600