XLM-RoBERTa Language Identification (20 languages)

Fine-tuned xlm-roberta-base that detects the language of a text across 20 languages, reaching 99.6% accuracy on a 10,000-sample held-out test set.

🚀 Live demo  |  💻 Code & full results

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "SashaSk/xlm-roberta-language-id"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

enc = tok("¿Dónde está la biblioteca?", return_tensors="pt", truncation=True)
pred = model.config.id2label[int(model(**enc).logits.argmax(-1))]   # -> "es"

Results

Evaluated on the papluca/language-identification test split (10,000 samples):

Model Accuracy Weighted F1
XLM-RoBERTa (this model) 99.6% 0.996
Logistic Regression + TF-IDF 89.2% 0.890
Multinomial Naive Bayes + TF-IDF 86.1% 0.847

The only non-trivial residual confusion is Hindi ↔ Urdu (per-language F1 ≈ 0.98), which is linguistically expected. Japanese, Turkish, and Chinese are classified perfectly (F1 = 1.00).

Supported languages

Arabic (ar), Bulgarian (bg), German (de), Greek (el), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swahili (sw), Thai (th), Turkish (tr), Urdu (ur), Vietnamese (vi), Chinese (zh).

Training

  • Base: xlm-roberta-base (~270M params), fine-tuned into a 20-way sequence classifier.
  • Data: papluca/language-identification (70k train / 10k val / 10k test).
  • Setup: Hugging Face Trainer, max sequence length 256, gradient accumulation, best-checkpoint selection on the validation split.

Full training/eval/serving code: github.com/SashaSkind/lang-classifier.

License

MIT.

Downloads last month
20
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SashaSk/xlm-roberta-language-id

Finetuned
(4007)
this model

Dataset used to train SashaSk/xlm-roberta-language-id

Space using SashaSk/xlm-roberta-language-id 1

Evaluation results