--- language: - multilingual license: mit tags: - text-classification - multilingual - xlm-roberta - topic-classification datasets: - Davlan/sib200 metrics: - accuracy - f1 --- # 🌍 Multilingual Topic Classifier A multilingual text classification model fine-tuned on the SIB-200 dataset, capable of classifying text into 7 topics across **205 languages**. ## Model Details - **Base model:** xlm-roberta-base - **Task:** Text Classification (Topic) - **Languages:** 205 - **Developed by:** Keshav0308 ## Topics | Label | Description | |-------|-------------| | 🌍 geography | Geographic content | | 🔬 science/technology | Science and tech content | | 🎬 entertainment | Entertainment content | | 🏛️ politics | Political content | | 🏥 health | Health and medical content | | ✈️ travel | Travel content | | ⚽ sports | Sports content | ## Performance | Metric | Score | |--------|-------| | Test Accuracy | 69.17% | | Test F1 Macro | 67.62% | | Languages | 205 | ## Usage ```python from transformers import pipeline classifier = pipeline( "text-classification", model="Keshav0308/multilingual-topic-classifier" ) # Works in any language! classifier("The patient was diagnosed with pneumonia.") # {'label': 'health', 'score': 0.999} classifier("El equipo ganó el campeonato mundial de fútbol.") # {'label': 'sports', 'score': 0.999} ``` ## Training Data Fine-tuned on [SIB-200](https://huggingface.co/datasets/Davlan/sib200) — a massively multilingual dataset with 205 languages. - Train samples: 143,705 - Validation samples: 20,295 - Test samples: 41,820