v1.0

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +138 -0
config.json +46 -0
model.safetensors +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +15 -0
tokenizer.json +3 -0
tokenizer_config.json +54 -0
training_args.bin +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,138 @@

+---
+license: mit
+base_model: xlm-roberta-base
+tags:
+- silvanus
+metrics:
+- precision
+- recall
+- f1
+- accuracy
+model-index:
+- name: xlm-roberta-base-ner-silvanus
+  results:
+  - task:
+      name: Token Classification
+      type: token-classification
+    dataset:
+      name: id_nergrit_corpus
+      type: id_nergrit_corpus
+      config: ner
+      split: validation
+      args: ner
+    metrics:
+    - name: Precision
+      type: precision
+      value: 0.918918918918919
+    - name: Recall
+      type: recall
+      value: 0.9272727272727272
+    - name: F1
+      type: f1
+      value: 0.9230769230769231
+    - name: Accuracy
+      type: accuracy
+      value: 0.9858518778229216
+language:
+- id
+- en
+- es
+- it
+- sk
+pipeline_tag: token-classification
+widget:
+- text: >-
+    Kebakaran hutan dan lahan terus terjadi dan semakin meluas di Kota
+    Palangkaraya, Kalimantan Tengah (Kalteng) pada hari Rabu, 15 Nopember 2023
+    20.00 WIB. Bahkan kobaran api mulai membakar pondok warga dan mendekati
+    permukiman. BZK #RCTINews #SeputariNews #News #Karhutla #KebakaranHutan
+    #HutanKalimantan #SILVANUS_Italian_Pilot_Testing
+  example_title: Indonesia
+- text: >-
+    Wildfire rages for a second day in Evia destroying a Natura 2000 protected
+    pine forest. - 5:51 PM Aug 14, 2019
+  example_title: English
+- text: >-
+    3 nov 2023 21:57 - Incendio forestal obliga a la evacuación de hasta 850
+    personas cerca del pueblo de Montichelvo en Valencia.
+  example_title: Spanish
+- text: >-
+    Incendi boschivi nell'est del Paese: 2 morti e oltre 50 case distrutte nello
+    stato del Queensland.
+  example_title: Italian
+- text: >-
+    Lesné požiare na Sicílii si vyžiadali dva ľudské životy a evakuáciu hotela
+    http://dlvr.it/SwW3sC - 23. septembra 2023 20:57
+  example_title: Slovak
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# xlm-roberta-base-ner-silvanus
+This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the Indonesian NER dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.0567
+- Precision: 0.9189
+- Recall: 0.9273
+- F1: 0.9231
+- Accuracy: 0.9859
+## Model description
+The XLM-RoBERTa model was proposed in [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
+- **Developed by:** See [associated paper](https://arxiv.org/abs/1911.02116)
+- **Model type:** Multi-lingual model
+- **Language(s) (NLP) or Countries (images):** XLM-RoBERTa is a multilingual model trained on 100 different languages; see [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr) for full list; model is fine-tuned on a dataset in English
+- **License:** More information needed
+- **Related Models:** [RoBERTa](https://huggingface.co/roberta-base), [XLM](https://huggingface.co/docs/transformers/model_doc/xlm)
+    - **Parent Model:** [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base)
+- **Resources for more information:** [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr)
+## Intended uses & limitations
+This model can be used to extract multilingual information such as location, date and time on social media (Twitter, etc.). This model is limited by an Indonesian language training data set to be tested in 4 languages (English, Spanish, Italian and Slovak) using zero-shot transfer learning techniques to extract multilingual information.
+## Training and evaluation data
+This model was fine-tuned on Indonesian NER datasets.
+Abbreviation|Description
+-|-
+O|Outside of a named entity
+B-LOC |Beginning of a location right after another location
+I-LOC |Location
+B-DAT |Beginning of a date right after another date
+I-DAT |Date
+B-TIM |Beginning of a time right after another time
+I-TIM |Time
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 2e-05
+- train_batch_size: 8
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 3
+### Training results
+| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
+|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
+| 0.1394        | 1.0   | 827  | 0.0559          | 0.8808    | 0.9257 | 0.9027 | 0.9842   |
+| 0.0468        | 2.0   | 1654 | 0.0575          | 0.9107    | 0.9190 | 0.9148 | 0.9849   |
+| 0.0279        | 3.0   | 2481 | 0.0567          | 0.9189    | 0.9273 | 0.9231 | 0.9859   |
+### Framework versions
+- Transformers 4.35.0
+- Pytorch 2.1.0+cu118
+- Datasets 2.14.6
+- Tokenizers 0.14.1

config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "_name_or_path": "xlm-roberta-base",
+  "architectures": [
+    "XLMRobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B-LOC",
+    "2": "I-LOC",
+    "3": "B-DAT",
+    "4": "I-DAT",
+    "5": "B-TIM",
+    "6": "I-TIM"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-DAT": 3,
+    "B-LOC": 1,
+    "B-TIM": 5,
+    "I-DAT": 4,
+    "I-LOC": 2,
+    "I-TIM": 6,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.35.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92cd496a4ea4ccfb04c64c5c5c4cb6210ae5ad1f89b56f42248b046fc8ad896b
+size 1109857804

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7bc482bbe4d17395039e8c52dfd60092475c46bdcd8cd16c7238b2c522cd2c8
+size 17082941

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:65fcda7383fe94a15e2b9530f5b4601fed9001ccb72e783caca56685b97ffd99
+size 4600