| --- |
| language: |
| - multilingual |
| license: mit |
| tags: |
| - text-classification |
| - multilingual |
| - xlm-roberta |
| - topic-classification |
| datasets: |
| - Davlan/sib200 |
| metrics: |
| - accuracy |
| - f1 |
| --- |
| |
| # 🌍 Multilingual Topic Classifier |
|
|
| A multilingual text classification model fine-tuned on the SIB-200 dataset, capable of classifying text into 7 topics across **205 languages**. |
|
|
| ## Model Details |
| - **Base model:** xlm-roberta-base |
| - **Task:** Text Classification (Topic) |
| - **Languages:** 205 |
| - **Developed by:** Keshav0308 |
|
|
| ## Topics |
| | Label | Description | |
| |-------|-------------| |
| | 🌍 geography | Geographic content | |
| | 🔬 science/technology | Science and tech content | |
| | 🎬 entertainment | Entertainment content | |
| | 🏛️ politics | Political content | |
| | 🏥 health | Health and medical content | |
| | ✈️ travel | Travel content | |
| | ⚽ sports | Sports content | |
|
|
| ## Performance |
| | Metric | Score | |
| |--------|-------| |
| | Test Accuracy | 69.17% | |
| | Test F1 Macro | 67.62% | |
| | Languages | 205 | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline( |
| "text-classification", |
| model="Keshav0308/multilingual-topic-classifier" |
| ) |
| |
| # Works in any language! |
| classifier("The patient was diagnosed with pneumonia.") |
| # {'label': 'health', 'score': 0.999} |
| |
| classifier("El equipo ganó el campeonato mundial de fútbol.") |
| # {'label': 'sports', 'score': 0.999} |
| ``` |
|
|
| ## Training Data |
| Fine-tuned on [SIB-200](https://huggingface.co/datasets/Davlan/sib200) — a massively multilingual dataset with 205 languages. |
|
|
| - Train samples: 143,705 |
| - Validation samples: 20,295 |
| - Test samples: 41,820 |