🌍 Multilingual Named Entity Recognition for Social Media

Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base

A fine-tuned XLM-RoBERTa-Base model for Named Entity Recognition (NER) on noisy social media text.

This model is optimized for multilingual informal content commonly found on:

  • Twitter / X
  • Instagram
  • TikTok
  • Facebook
  • Online forums

It supports both Bahasa Indonesia and English, making it suitable for moderation systems, social listening, and content intelligence pipelines.


🔍 Model Overview

  • Architecture: FacebookAI/xlm-roberta-base
  • Task: Token Classification (NER)
  • Languages: Indonesian, English
  • Domain: Informal & Social Media Text
  • Training Date: 2026-02-26

🏷️ Supported Entity Labels

This model detects the following entity types:

Label Description
PER Person
ORG Organization
NOR Political Organization
GPE Geopolitical Entity
LOC Location
FAC Facility
LAW Legal Entity (e.g., Undang-Undang)
EVT Event
WOA Work of Art

Tagging Scheme

BIO tagging format is used:

  • B-XXX → Beginning of an entity
  • I-XXX → Inside an entity
  • O → Outside any entity

📊 Model Performance

Evaluated on held-out validation dataset:

Metric Score
F1 Score 0.8387
Precision 0.8203
Recall 0.8580
Training Loss 0.0021
Validation Loss 0.1310

Evaluation Details

  • Metric computed using seqeval
  • Micro-averaged F1 score
  • Validation set contains balanced entity distribution

🏗️ Training Configuration

Parameter Value
Base Model xlm-roberta-base
Training Samples 695,108
Validation Samples 106,197
Epochs 5
Learning Rate 4e-5
Batch Size 32
Optimizer AdamW
Scheduler Linear Warmup
Framework Hugging Face Transformers

🚀 Usage

Quick Inference (Hugging Face Pipeline)

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="nahiar/xlm-roberta-ner",
    aggregation_strategy="simple"
)

text_id = "Jokowi menghadiri World Economic Forum di Davos."
text_en = "Apple is opening a new office in Jakarta next month."

print(ner(text_id))
print(ner(text_en))

Aggregation Strategy Notes

  • "simple" → Recommended (merges subword tokens)
  • "first" → Uses first token representation
  • "average" → Averages token scores
  • "max" → Takes maximum token score

🎯 Intended Use Cases

  • Social media Named Entity Recognition
  • Comment & post filtering
  • Content moderation assistance
  • Political monitoring
  • Brand & organization tracking
  • Multilingual content intelligence systems

⚠️ Limitations

  • Supports only the defined entity set: NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA
  • Not optimized for:
    • Formal academic/legal documents
    • Extremely short or ambiguous messages
    • Heavy slang or sarcastic expressions
  • Performance may degrade on highly code-mixed sentences
  • The model may inherit bias from training data

⚖️ Ethical Considerations

This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.

It is not intended to replace human judgment in high-risk or sensitive decision-making systems.

Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.


🖥️ Hardware Recommendations

  • Recommended: GPU (≥ 8GB VRAM) for optimal performance
  • CPU inference supported but slower
  • Compatible with FP16 mixed precision for faster inference

📜 License

Released under the Apache 2.0 License.
Free for commercial and research use.


📚 Citation

@misc{hidayatuloh2026multilingualner,
  author    = {Nuri Hidayatuloh},
  title     = {Multilingual Named Entity Recognition for Social Media},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/nahiar/xlm-roberta-ner}
}

🙌 Acknowledgements

  • Hugging Face Transformers
  • Facebook AI Research — XLM-RoBERTa
  • Open-source NLP community
  • Contributors and dataset annotators
Downloads last month
120
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nahiar/xlm-roberta-ner

Finetuned
(3811)
this model