🌍 Multilingual Named Entity Recognition for Social Media

Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base

A fine-tuned XLM-RoBERTa-Base model for Named Entity Recognition (NER) on noisy social media text.

This model is optimized for multilingual informal content commonly found on:

Twitter / X
Instagram
TikTok
Facebook
Online forums

It supports both Bahasa Indonesia and English, making it suitable for moderation systems, social listening, and content intelligence pipelines.

🔍 Model Overview

Architecture: FacebookAI/xlm-roberta-base
Task: Token Classification (NER)
Languages: Indonesian, English
Domain: Informal & Social Media Text
Training Date: 2026-02-26

🏷️ Supported Entity Labels

This model detects the following entity types:

Label	Description
PER	Person
ORG	Organization
NOR	Political Organization
GPE	Geopolitical Entity
LOC	Location
FAC	Facility
LAW	Legal Entity (e.g., Undang-Undang)
EVT	Event
WOA	Work of Art

Tagging Scheme

BIO tagging format is used:

B-XXX → Beginning of an entity
I-XXX → Inside an entity
O → Outside any entity

📊 Model Performance

Evaluated on held-out validation dataset:

Metric	Score
F1 Score	0.8387
Precision	0.8203
Recall	0.8580
Training Loss	0.0021
Validation Loss	0.1310

Evaluation Details

Metric computed using seqeval
Micro-averaged F1 score
Validation set contains balanced entity distribution

🏗️ Training Configuration

Parameter	Value
Base Model	xlm-roberta-base
Training Samples	695,108
Validation Samples	106,197
Epochs	5
Learning Rate	4e-5
Batch Size	32
Optimizer	AdamW
Scheduler	Linear Warmup
Framework	Hugging Face Transformers

🚀 Usage

Quick Inference (Hugging Face Pipeline)

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="nahiar/xlm-roberta-ner",
    aggregation_strategy="simple"
)

text_id = "Jokowi menghadiri World Economic Forum di Davos."
text_en = "Apple is opening a new office in Jakarta next month."

print(ner(text_id))
print(ner(text_en))

Aggregation Strategy Notes

"simple" → Recommended (merges subword tokens)
"first" → Uses first token representation
"average" → Averages token scores
"max" → Takes maximum token score

🎯 Intended Use Cases

Social media Named Entity Recognition
Comment & post filtering
Content moderation assistance
Political monitoring
Brand & organization tracking
Multilingual content intelligence systems

⚠️ Limitations

Supports only the defined entity set: NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA
Not optimized for:
- Formal academic/legal documents
- Extremely short or ambiguous messages
- Heavy slang or sarcastic expressions
Performance may degrade on highly code-mixed sentences
The model may inherit bias from training data

⚖️ Ethical Considerations

This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.

It is not intended to replace human judgment in high-risk or sensitive decision-making systems.

Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.

🖥️ Hardware Recommendations

Recommended: GPU (≥ 8GB VRAM) for optimal performance
CPU inference supported but slower
Compatible with FP16 mixed precision for faster inference

📜 License

Released under the Apache 2.0 License.
Free for commercial and research use.

📚 Citation

@misc{hidayatuloh2026multilingualner,
  author    = {Nuri Hidayatuloh},
  title     = {Multilingual Named Entity Recognition for Social Media},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/nahiar/xlm-roberta-ner}
}