---
language:
- multilingual
license: mit
tags:
- text-classification
- multilingual
- xlm-roberta
- topic-classification
datasets:
- Davlan/sib200
metrics:
- accuracy
- f1
---

# 🌍 Multilingual Topic Classifier

A multilingual text classification model fine-tuned on the SIB-200 dataset, capable of classifying text into 7 topics across **205 languages**.

## Model Details
- **Base model:** xlm-roberta-base
- **Task:** Text Classification (Topic)
- **Languages:** 205
- **Developed by:** Keshav0308

## Topics
| Label | Description |
|-------|-------------|
| 🌍 geography | Geographic content |
| 🔬 science/technology | Science and tech content |
| 🎬 entertainment | Entertainment content |
| 🏛️ politics | Political content |
| 🏥 health | Health and medical content |
| ✈️ travel | Travel content |
| ⚽ sports | Sports content |

## Performance
| Metric | Score |
|--------|-------|
| Test Accuracy | 69.17% |
| Test F1 Macro | 67.62% |
| Languages | 205 |

## Usage

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Keshav0308/multilingual-topic-classifier"
)

# Works in any language!
classifier("The patient was diagnosed with pneumonia.")
# {'label': 'health', 'score': 0.999}

classifier("El equipo ganó el campeonato mundial de fútbol.")
# {'label': 'sports', 'score': 0.999}
```

## Training Data
Fine-tuned on [SIB-200](https://huggingface.co/datasets/Davlan/sib200) — a massively multilingual dataset with 205 languages.

- Train samples: 143,705
- Validation samples: 20,295  
- Test samples: 41,820