metadata
language:
- multilingual
license: mit
tags:
- text-classification
- multilingual
- xlm-roberta
- topic-classification
datasets:
- Davlan/sib200
metrics:
- accuracy
- f1
🌍 Multilingual Topic Classifier
A multilingual text classification model fine-tuned on the SIB-200 dataset, capable of classifying text into 7 topics across 205 languages.
Model Details
- Base model: xlm-roberta-base
- Task: Text Classification (Topic)
- Languages: 205
- Developed by: Keshav0308
Topics
| Label | Description |
|---|---|
| 🌍 geography | Geographic content |
| 🔬 science/technology | Science and tech content |
| 🎬 entertainment | Entertainment content |
| 🏛️ politics | Political content |
| 🏥 health | Health and medical content |
| ✈️ travel | Travel content |
| ⚽ sports | Sports content |
Performance
| Metric | Score |
|---|---|
| Test Accuracy | 69.17% |
| Test F1 Macro | 67.62% |
| Languages | 205 |
Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Keshav0308/multilingual-topic-classifier"
)
# Works in any language!
classifier("The patient was diagnosed with pneumonia.")
# {'label': 'health', 'score': 0.999}
classifier("El equipo ganó el campeonato mundial de fútbol.")
# {'label': 'sports', 'score': 0.999}
Training Data
Fine-tuned on SIB-200 — a massively multilingual dataset with 205 languages.
- Train samples: 143,705
- Validation samples: 20,295
- Test samples: 41,820