Instructions to use SashaSk/xlm-roberta-language-id with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SashaSk/xlm-roberta-language-id with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="SashaSk/xlm-roberta-language-id")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("SashaSk/xlm-roberta-language-id") model = AutoModelForSequenceClassification.from_pretrained("SashaSk/xlm-roberta-language-id") - Notebooks
- Google Colab
- Kaggle
XLM-RoBERTa Language Identification (20 languages)
Fine-tuned xlm-roberta-base that detects
the language of a text across 20 languages, reaching 99.6% accuracy on a 10,000-sample
held-out test set.
🚀 Live demo | 💻 Code & full results
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
repo = "SashaSk/xlm-roberta-language-id"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
enc = tok("¿Dónde está la biblioteca?", return_tensors="pt", truncation=True)
pred = model.config.id2label[int(model(**enc).logits.argmax(-1))] # -> "es"
Results
Evaluated on the papluca/language-identification
test split (10,000 samples):
| Model | Accuracy | Weighted F1 |
|---|---|---|
| XLM-RoBERTa (this model) | 99.6% | 0.996 |
| Logistic Regression + TF-IDF | 89.2% | 0.890 |
| Multinomial Naive Bayes + TF-IDF | 86.1% | 0.847 |
The only non-trivial residual confusion is Hindi ↔ Urdu (per-language F1 ≈ 0.98), which is linguistically expected. Japanese, Turkish, and Chinese are classified perfectly (F1 = 1.00).
Supported languages
Arabic (ar), Bulgarian (bg), German (de), Greek (el), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swahili (sw), Thai (th), Turkish (tr), Urdu (ur), Vietnamese (vi), Chinese (zh).
Training
- Base:
xlm-roberta-base(~270M params), fine-tuned into a 20-way sequence classifier. - Data:
papluca/language-identification(70k train / 10k val / 10k test). - Setup: Hugging Face
Trainer, max sequence length 256, gradient accumulation, best-checkpoint selection on the validation split.
Full training/eval/serving code: github.com/SashaSkind/lang-classifier.
License
MIT.
- Downloads last month
- 20
Model tree for SashaSk/xlm-roberta-language-id
Base model
FacebookAI/xlm-roberta-baseDataset used to train SashaSk/xlm-roberta-language-id
Space using SashaSk/xlm-roberta-language-id 1
Evaluation results
- Accuracy on papluca/language-identification (test)test set self-reported0.996
- Weighted F1 on papluca/language-identification (test)test set self-reported0.996