Initial upload (auto-create if missing)

Browse files

Files changed (15) hide show

.ipynb_checkpoints/README-checkpoint.md +206 -0
.ipynb_checkpoints/eval_results-checkpoint.txt +4 -0
.ipynb_checkpoints/training_progress_scores-checkpoint.csv +6 -0
README.md +206 -0
config.json +69 -0
eval_results.txt +4 -0
model.safetensors +3 -0
model_args.json +1 -0
optimizer.pt +3 -0
scheduler.pt +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +15 -0
tokenizer_config.json +57 -0
training_args.bin +3 -0
training_progress_scores.csv +6 -0

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,206 @@

+---
+language:
+- id
+- en
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- token-classification
+- named-entity-recognition
+- indonesian
+- english
+- multilingual
+- xlm-roberta
+- social-media
+license: apache-2.0
+metrics:
+- f1
+- precision
+- recall
+base_model:
+- FacebookAI/xlm-roberta-base
+---
+# 🌍 Multilingual Named Entity Recognition for Social Media
+**Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base**
+A fine-tuned **XLM-RoBERTa-Base** model for **Named Entity Recognition (NER)** on noisy social media text.
+This model is optimized for multilingual informal content commonly found on:
+- Twitter / X
+- Instagram
+- TikTok
+- Facebook
+- Online forums
+It supports both **Bahasa Indonesia** and **English**, making it suitable for moderation systems, social listening, and content intelligence pipelines.
+---
+## 🔍 Model Overview
+- **Architecture**: `FacebookAI/xlm-roberta-base`
+- **Task**: Token Classification (NER)
+- **Languages**: Indonesian, English
+- **Domain**: Informal & Social Media Text
+- **Training Date**: 2026-02-26
+---
+## 🏷️ Supported Entity Labels
+This model detects the following entity types:
+| Label | Description |
+|------:|------------|
+| PER   | Person |
+| ORG   | Organization |
+| NOR   | Political Organization |
+| GPE   | Geopolitical Entity |
+| LOC   | Location |
+| FAC   | Facility |
+| LAW   | Legal Entity (e.g., Undang-Undang) |
+| EVT   | Event |
+| WOA   | Work of Art |
+### Tagging Scheme
+BIO tagging format is used:
+- `B-XXX` → Beginning of an entity
+- `I-XXX` → Inside an entity
+- `O` → Outside any entity
+---
+## 📊 Model Performance
+Evaluated on held-out validation dataset:
+| Metric           | Score  |
+|-----------------|--------|
+| F1 Score        | 0.8387 |
+| Precision       | 0.8203 |
+| Recall          | 0.8580 |
+| Training Loss   | 0.0021 |
+| Validation Loss | 0.1310 |
+**Evaluation Details**
+- Metric computed using `seqeval`
+- Micro-averaged F1 score
+- Validation set contains balanced entity distribution
+---
+## 🏗️ Training Configuration
+| Parameter          | Value            |
+|-------------------|------------------|
+| Base Model         | xlm-roberta-base |
+| Training Samples   | 695,108          |
+| Validation Samples | 106,197          |
+| Epochs             | 5                |
+| Learning Rate      | 4e-5             |
+| Batch Size         | 32               |
+| Optimizer          | AdamW            |
+| Scheduler          | Linear Warmup    |
+| Framework          | Hugging Face Transformers |
+---
+## 🚀 Usage
+### Quick Inference (Hugging Face Pipeline)
+```python
+from transformers import pipeline
+ner = pipeline(
+    "token-classification",
+    model="nahiar/xlm-roberta-ner",
+    aggregation_strategy="simple"
+)
+text_id = "Jokowi menghadiri World Economic Forum di Davos."
+text_en = "Apple is opening a new office in Jakarta next month."
+print(ner(text_id))
+print(ner(text_en))
+```
+### Aggregation Strategy Notes
+- `"simple"` → Recommended (merges subword tokens)
+- `"first"` → Uses first token representation
+- `"average"` → Averages token scores
+- `"max"` → Takes maximum token score
+---
+## 🎯 Intended Use Cases
+- Social media Named Entity Recognition
+- Comment & post filtering
+- Content moderation assistance
+- Political monitoring
+- Brand & organization tracking
+- Multilingual content intelligence systems
+---
+## ⚠️ Limitations
+- Supports only the defined entity set:
+  `NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA`
+- Not optimized for:
+  - Formal academic/legal documents
+  - Extremely short or ambiguous messages
+  - Heavy slang or sarcastic expressions
+- Performance may degrade on highly code-mixed sentences
+- The model may inherit bias from training data
+---
+## ⚖️ Ethical Considerations
+This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.
+It is not intended to replace human judgment in high-risk or sensitive decision-making systems.
+Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.
+---
+## 🖥️ Hardware Recommendations
+- **Recommended**: GPU (≥ 8GB VRAM) for optimal performance
+- CPU inference supported but slower
+- Compatible with FP16 mixed precision for faster inference
+---
+## 📜 License
+Released under the **Apache 2.0 License**.
+Free for commercial and research use.
+---
+## 📚 Citation
+```bibtex
+@misc{hidayatuloh2026multilingualner,
+  author    = {Nuri Hidayatuloh},
+  title     = {Multilingual Named Entity Recognition for Social Media},
+  year      = {2026},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/nahiar/xlm-roberta-ner}
+}
+```
+---
+## 🙌 Acknowledgements
+- Hugging Face Transformers
+- Facebook AI Research — XLM-RoBERTa
+- Open-source NLP community
+- Contributors and dataset annotators

.ipynb_checkpoints/eval_results-checkpoint.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+eval_loss = 0.13100967527582094
+f1_score = 0.8387909319899245
+precision = 0.8203654280435229
+recall = 0.8580631307708826

.ipynb_checkpoints/training_progress_scores-checkpoint.csv ADDED Viewed

	@@ -0,0 +1,6 @@

+global_step,train_loss,eval_loss,precision,recall,f1_score
+392,0.3247712254524231,0.12926855454078087,0.7768145161290323,0.8273566673824351,0.8012893833835916
+784,0.0024179292377084494,0.11792839991931732,0.8139290958674219,0.833154391238995,0.8234295415959253
+1176,0.2672019302845001,0.12483000898590454,0.8082470038594353,0.8544127120463818,0.8306889352818372
+1568,0.018565170466899872,0.12438045413448261,0.8160919540229885,0.853768520506764,0.8345051946689054
+1960,0.002101300982758403,0.13100967527582094,0.8203654280435229,0.8580631307708826,0.8387909319899245

README.md ADDED Viewed

	@@ -0,0 +1,206 @@

+---
+language:
+- id
+- en
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- token-classification
+- named-entity-recognition
+- indonesian
+- english
+- multilingual
+- xlm-roberta
+- social-media
+license: apache-2.0
+metrics:
+- f1
+- precision
+- recall
+base_model:
+- FacebookAI/xlm-roberta-base
+---
+# 🌍 Multilingual Named Entity Recognition for Social Media
+**Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base**
+A fine-tuned **XLM-RoBERTa-Base** model for **Named Entity Recognition (NER)** on noisy social media text.
+This model is optimized for multilingual informal content commonly found on:
+- Twitter / X
+- Instagram
+- TikTok
+- Facebook
+- Online forums
+It supports both **Bahasa Indonesia** and **English**, making it suitable for moderation systems, social listening, and content intelligence pipelines.
+---
+## 🔍 Model Overview
+- **Architecture**: `FacebookAI/xlm-roberta-base`
+- **Task**: Token Classification (NER)
+- **Languages**: Indonesian, English
+- **Domain**: Informal & Social Media Text
+- **Training Date**: 2026-02-26
+---
+## 🏷️ Supported Entity Labels
+This model detects the following entity types:
+| Label | Description |
+|------:|------------|
+| PER   | Person |
+| ORG   | Organization |
+| NOR   | Political Organization |
+| GPE   | Geopolitical Entity |
+| LOC   | Location |
+| FAC   | Facility |
+| LAW   | Legal Entity (e.g., Undang-Undang) |
+| EVT   | Event |
+| WOA   | Work of Art |
+### Tagging Scheme
+BIO tagging format is used:
+- `B-XXX` → Beginning of an entity
+- `I-XXX` → Inside an entity
+- `O` → Outside any entity
+---
+## 📊 Model Performance
+Evaluated on held-out validation dataset:
+| Metric           | Score  |
+|-----------------|--------|
+| F1 Score        | 0.8387 |
+| Precision       | 0.8203 |
+| Recall          | 0.8580 |
+| Training Loss   | 0.0021 |
+| Validation Loss | 0.1310 |
+**Evaluation Details**
+- Metric computed using `seqeval`
+- Micro-averaged F1 score
+- Validation set contains balanced entity distribution
+---
+## 🏗️ Training Configuration
+| Parameter          | Value            |
+|-------------------|------------------|
+| Base Model         | xlm-roberta-base |
+| Training Samples   | 695,108          |
+| Validation Samples | 106,197          |
+| Epochs             | 5                |
+| Learning Rate      | 4e-5             |
+| Batch Size         | 32               |
+| Optimizer          | AdamW            |
+| Scheduler          | Linear Warmup    |
+| Framework          | Hugging Face Transformers |
+---
+## 🚀 Usage
+### Quick Inference (Hugging Face Pipeline)
+```python
+from transformers import pipeline
+ner = pipeline(
+    "token-classification",
+    model="nahiar/xlm-roberta-ner",
+    aggregation_strategy="simple"
+)
+text_id = "Jokowi menghadiri World Economic Forum di Davos."
+text_en = "Apple is opening a new office in Jakarta next month."
+print(ner(text_id))
+print(ner(text_en))
+```
+### Aggregation Strategy Notes
+- `"simple"` → Recommended (merges subword tokens)
+- `"first"` → Uses first token representation
+- `"average"` → Averages token scores
+- `"max"` → Takes maximum token score
+---
+## 🎯 Intended Use Cases
+- Social media Named Entity Recognition
+- Comment & post filtering
+- Content moderation assistance
+- Political monitoring
+- Brand & organization tracking
+- Multilingual content intelligence systems
+---
+## ⚠️ Limitations
+- Supports only the defined entity set:
+  `NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA`
+- Not optimized for:
+  - Formal academic/legal documents
+  - Extremely short or ambiguous messages
+  - Heavy slang or sarcastic expressions
+- Performance may degrade on highly code-mixed sentences
+- The model may inherit bias from training data
+---
+## ⚖️ Ethical Considerations
+This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.
+It is not intended to replace human judgment in high-risk or sensitive decision-making systems.
+Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.
+---
+## 🖥️ Hardware Recommendations
+- **Recommended**: GPU (≥ 8GB VRAM) for optimal performance
+- CPU inference supported but slower
+- Compatible with FP16 mixed precision for faster inference
+---
+## 📜 License
+Released under the **Apache 2.0 License**.
+Free for commercial and research use.
+---
+## 📚 Citation
+```bibtex
+@misc{hidayatuloh2026multilingualner,
+  author    = {Nuri Hidayatuloh},
+  title     = {Multilingual Named Entity Recognition for Social Media},
+  year      = {2026},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/nahiar/xlm-roberta-ner}
+}
+```
+---
+## 🙌 Acknowledgements
+- Hugging Face Transformers
+- Facebook AI Research — XLM-RoBERTa
+- Open-source NLP community
+- Contributors and dataset annotators

config.json ADDED Viewed

	@@ -0,0 +1,69 @@

+{
+  "architectures": [
+    "XLMRobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "B-EVT",
+    "1": "B-GPE",
+    "2": "B-LOC",
+    "3": "B-PER",
+    "4": "B-FAC",
+    "5": "B-LAW",
+    "6": "B-NOR",
+    "7": "B-WOA",
+    "8": "B-ORG",
+    "9": "I-EVT",
+    "10": "I-GPE",
+    "11": "I-LOC",
+    "12": "I-PER",
+    "13": "I-FAC",
+    "14": "I-LAW",
+    "15": "I-NOR",
+    "16": "I-WOA",
+    "17": "I-ORG",
+    "18": "O"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-EVT": 0,
+    "B-FAC": 4,
+    "B-GPE": 1,
+    "B-LAW": 5,
+    "B-LOC": 2,
+    "B-NOR": 6,
+    "B-ORG": 8,
+    "B-PER": 3,
+    "B-WOA": 7,
+    "I-EVT": 9,
+    "I-FAC": 13,
+    "I-GPE": 10,
+    "I-LAW": 14,
+    "I-LOC": 11,
+    "I-NOR": 15,
+    "I-ORG": 17,
+    "I-PER": 12,
+    "I-WOA": 16,
+    "O": 18
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.3",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

eval_results.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+eval_loss = 0.13100967527582094
+f1_score = 0.8387909319899245
+precision = 0.8203654280435229
+recall = 0.8580631307708826

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b78e1a9ffcb81a75b4968289f6f1a02777f5824acbdda77f88d67193d25cc0a2
+size 1109894716

model_args.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"adafactor_beta1": null, "adafactor_clip_threshold": 1.0, "adafactor_decay_rate": -0.8, "adafactor_eps": [1e-30, 0.001], "adafactor_relative_step": true, "adafactor_scale_parameter": true, "adafactor_warmup_init": true, "adam_betas": [0.9, 0.999], "adam_epsilon": 1e-08, "best_model_dir": "../model", "cache_dir": "cache_dir/", "config": {}, "cosine_schedule_num_cycles": 0.5, "custom_layer_parameters": [], "custom_parameter_groups": [], "dataloader_num_workers": 0, "dataset_cache_dir": null, "do_lower_case": false, "dynamic_quantize": false, "early_stopping_consider_epochs": false, "early_stopping_delta": 0, "early_stopping_metric": "eval_loss", "early_stopping_metric_minimize": true, "early_stopping_patience": 3, "encoding": null, "eval_batch_size": 100, "evaluate_during_training": true, "evaluate_during_training_silent": true, "evaluate_during_training_steps": 2000, "evaluate_during_training_verbose": false, "evaluate_each_epoch": true, "fp16": false, "gradient_accumulation_steps": 1, "learning_rate": 4e-05, "local_rank": -1, "logging_steps": 50, "loss_type": null, "loss_args": {}, "manual_seed": null, "max_grad_norm": 1.0, "max_seq_length": 128, "model_name": "xlm-roberta-base", "model_type": "xlmroberta", "multiprocessing_chunksize": -1, "n_gpu": 1, "no_cache": false, "no_save": false, "not_saved_args": [], "num_train_epochs": 5, "optimizer": "AdamW", "output_dir": "../model", "overwrite_output_dir": true, "polynomial_decay_schedule_lr_end": 1e-07, "polynomial_decay_schedule_power": 1.0, "process_count": 62, "quantized_model": false, "reprocess_input_data": true, "save_best_model": true, "save_eval_checkpoints": false, "save_model_every_epoch": false, "save_optimizer_and_scheduler": true, "save_steps": -1, "scheduler": "linear_schedule_with_warmup", "silent": false, "skip_special_tokens": true, "tensorboard_dir": null, "thread_count": null, "tokenizer_name": null, "tokenizer_type": null, "train_batch_size": 32, "train_custom_parameters_only": false, "trust_remote_code": false, "use_cached_eval_features": false, "use_early_stopping": false, "use_hf_datasets": false, "use_multiprocessing": true, "use_multiprocessing_for_evaluation": true, "wandb_kwargs": {}, "wandb_project": null, "warmup_ratio": 0.06, "warmup_steps": 118, "weight_decay": 0.0, "model_class": "NERModel", "classification_report": false, "labels_list": ["B-EVT", "B-GPE", "B-LOC", "B-PER", "B-FAC", "B-LAW", "B-NOR", "B-WOA", "B-ORG", "I-EVT", "I-GPE", "I-LOC", "I-PER", "I-FAC", "I-LAW", "I-NOR", "I-WOA", "I-ORG", "O"], "lazy_loading": false, "lazy_loading_start_line": 0, "onnx": false, "special_tokens_list": []}

optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0f91a36470f359f68abc07093a9a16b60b342794fbdbdfcf98275d1186f8a2c
+size 2219908235

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bf3294d889bf48681ced831367e12487b88118b5ec30b2a9b0b7f2030688db6a
+size 1465

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "do_lower_case": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62d0ba91edf9c6d27b41c769be47f3b02835e7645acd8233dc1ecadb7b7b836c
+size 4113

training_progress_scores.csv ADDED Viewed

	@@ -0,0 +1,6 @@

+global_step,train_loss,eval_loss,precision,recall,f1_score
+392,0.3247712254524231,0.12926855454078087,0.7768145161290323,0.8273566673824351,0.8012893833835916
+784,0.0024179292377084494,0.11792839991931732,0.8139290958674219,0.833154391238995,0.8234295415959253
+1176,0.2672019302845001,0.12483000898590454,0.8082470038594353,0.8544127120463818,0.8306889352818372
+1568,0.018565170466899872,0.12438045413448261,0.8160919540229885,0.853768520506764,0.8345051946689054
+1960,0.002101300982758403,0.13100967527582094,0.8203654280435229,0.8580631307708826,0.8387909319899245