XLM-RoBERTa Language Identification (20 languages)

Fine-tuned xlm-roberta-base that detects the language of a text across 20 languages, reaching 99.6% accuracy on a 10,000-sample held-out test set.

🚀 Live demo | 💻 Code & full results

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "SashaSk/xlm-roberta-language-id"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSequenceClassification.from_pretrained(repo).eval()

enc = tok("¿Dónde está la biblioteca?", return_tensors="pt", truncation=True)
pred = model.config.id2label[int(model(**enc).logits.argmax(-1))]   # -> "es"

Results

Evaluated on the papluca/language-identification test split (10,000 samples):

Model	Accuracy	Weighted F1
XLM-RoBERTa (this model)	99.6%	0.996
Logistic Regression + TF-IDF	89.2%	0.890
Multinomial Naive Bayes + TF-IDF	86.1%	0.847

The only non-trivial residual confusion is Hindi ↔ Urdu (per-language F1 ≈ 0.98), which is linguistically expected. Japanese, Turkish, and Chinese are classified perfectly (F1 = 1.00).

Supported languages

Arabic (ar), Bulgarian (bg), German (de), Greek (el), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swahili (sw), Thai (th), Turkish (tr), Urdu (ur), Vietnamese (vi), Chinese (zh).

Training

Base: xlm-roberta-base (~270M params), fine-tuned into a 20-way sequence classifier.
Data: papluca/language-identification (70k train / 10k val / 10k test).
Setup: Hugging Face Trainer, max sequence length 256, gradient accumulation, best-checkpoint selection on the validation split.

Full training/eval/serving code: github.com/SashaSkind/lang-classifier.

License

MIT.

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for SashaSk/xlm-roberta-language-id

Base model

FacebookAI/xlm-roberta-base

Finetuned

(4100)

this model

Dataset used to train SashaSk/xlm-roberta-language-id

Space using SashaSk/xlm-roberta-language-id 1

Evaluation results

Accuracy on papluca/language-identification (test)
test set self-reported

0.996
Weighted F1 on papluca/language-identification (test)
test set self-reported

0.996