🌍 Multilingual Named Entity Recognition for Social Media
Indonesian 🇮🇩 & English 🇬🇧 | XLM-RoBERTa Base
A fine-tuned XLM-RoBERTa-Base model for Named Entity Recognition (NER) on noisy social media text.
This model is optimized for multilingual informal content commonly found on:
- Twitter / X
- TikTok
- Online forums
It supports both Bahasa Indonesia and English, making it suitable for moderation systems, social listening, and content intelligence pipelines.
🔍 Model Overview
- Architecture:
FacebookAI/xlm-roberta-base - Task: Token Classification (NER)
- Languages: Indonesian, English
- Domain: Informal & Social Media Text
- Training Date: 2026-02-26
🏷️ Supported Entity Labels
This model detects the following entity types:
| Label | Description |
|---|---|
| PER | Person |
| ORG | Organization |
| NOR | Political Organization |
| GPE | Geopolitical Entity |
| LOC | Location |
| FAC | Facility |
| LAW | Legal Entity (e.g., Undang-Undang) |
| EVT | Event |
| WOA | Work of Art |
Tagging Scheme
BIO tagging format is used:
B-XXX→ Beginning of an entityI-XXX→ Inside an entityO→ Outside any entity
📊 Model Performance
Evaluated on held-out validation dataset:
| Metric | Score |
|---|---|
| F1 Score | 0.8387 |
| Precision | 0.8203 |
| Recall | 0.8580 |
| Training Loss | 0.0021 |
| Validation Loss | 0.1310 |
Evaluation Details
- Metric computed using
seqeval - Micro-averaged F1 score
- Validation set contains balanced entity distribution
🏗️ Training Configuration
| Parameter | Value |
|---|---|
| Base Model | xlm-roberta-base |
| Training Samples | 695,108 |
| Validation Samples | 106,197 |
| Epochs | 5 |
| Learning Rate | 4e-5 |
| Batch Size | 32 |
| Optimizer | AdamW |
| Scheduler | Linear Warmup |
| Framework | Hugging Face Transformers |
🚀 Usage
Quick Inference (Hugging Face Pipeline)
from transformers import pipeline
ner = pipeline(
"token-classification",
model="nahiar/xlm-roberta-ner",
aggregation_strategy="simple"
)
text_id = "Jokowi menghadiri World Economic Forum di Davos."
text_en = "Apple is opening a new office in Jakarta next month."
print(ner(text_id))
print(ner(text_en))
Aggregation Strategy Notes
"simple"→ Recommended (merges subword tokens)"first"→ Uses first token representation"average"→ Averages token scores"max"→ Takes maximum token score
🎯 Intended Use Cases
- Social media Named Entity Recognition
- Comment & post filtering
- Content moderation assistance
- Political monitoring
- Brand & organization tracking
- Multilingual content intelligence systems
⚠️ Limitations
- Supports only the defined entity set:
NOR, GPE, PER, ORG, EVT, LOC, LAW, FAC, WOA - Not optimized for:
- Formal academic/legal documents
- Extremely short or ambiguous messages
- Heavy slang or sarcastic expressions
- Performance may degrade on highly code-mixed sentences
- The model may inherit bias from training data
⚖️ Ethical Considerations
This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.
It is not intended to replace human judgment in high-risk or sensitive decision-making systems.
Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.
🖥️ Hardware Recommendations
- Recommended: GPU (≥ 8GB VRAM) for optimal performance
- CPU inference supported but slower
- Compatible with FP16 mixed precision for faster inference
📜 License
Released under the Apache 2.0 License.
Free for commercial and research use.
📚 Citation
@misc{hidayatuloh2026multilingualner,
author = {Nuri Hidayatuloh},
title = {Multilingual Named Entity Recognition for Social Media},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/nahiar/xlm-roberta-ner}
}
🙌 Acknowledgements
- Hugging Face Transformers
- Facebook AI Research — XLM-RoBERTa
- Open-source NLP community
- Contributors and dataset annotators
- Downloads last month
- 120
Model tree for nahiar/xlm-roberta-ner
Base model
FacebookAI/xlm-roberta-base