langdetect-distilbert-10lang-CE
Fine-tuned distilbert-base-multilingual-cased for language identification across 10 European languages, with a focus on short, chat-style input (as short as 3 characters).
Model Description
Standard language detection libraries perform poorly on short texts (F1 ≈ 0.62–0.87 on inputs under 30 characters). This model was fine-tuned with a chat-oriented training distribution to close that gap, achieving F1 = 0.9680 on short texts — 10–34 percentage points above all evaluated general-purpose tools.
| Property | Value |
|---|---|
| Base model | distilbert/distilbert-base-multilingual-cased |
| Task | 10-class language classification |
| Languages | cs, de, el, en, fr, hr, hu, pl, sk, sl |
| Training samples | 1,200,000 (120k/language) |
| Max input length | 256 tokens |
| Precision | BF16 (Ampere GPU) |
Supported Languages
| Code | Language | Code | Language |
|---|---|---|---|
cs |
Czech | hu |
Hungarian |
de |
German | pl |
Polish |
el |
Greek | sk |
Slovak |
en |
English | sl |
Slovenian |
fr |
French | hr |
Croatian |
Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="GaborMadarasz/langdetect-distilbert-10lang-CE",
device=0, # GPU; use -1 for CPU
)
classifier("Mikor érkezel haza?")
# [{'label': 'hu', 'score': 0.9997}]
classifier("Ok")
# [{'label': 'en', 'score': 0.9821}]
classifier("Bonjour!")
# [{'label': 'fr', 'score': 0.9993}]
Training Data
Training used a hybrid corpus combining three sources:
| Source | Languages | Purpose |
|---|---|---|
| Europarl (1996–2011) | cs, de, el, en, fr, hu, pl, sk, sl | Medium and long sentences |
| OpenSubtitles 2018 | All 10 languages | Short sentence fallback |
| MaCoCu-hr 2.0 | hr | Croatian web corpus |
Chat-Oriented Length Distribution
Standard corpora are dominated by long formal sentences. To reflect real chat traffic, training samples were bucket-sampled by character length:
| Bucket | Range | Share | Rationale |
|---|---|---|---|
| Short | 3–30 chars | 40% | Single words, greetings, brief replies |
| Medium | 31–80 chars | 35% | Typical chat messages |
| Long | 81–150 chars | 25% | Multi-clause sentences |
Europarl filled medium and long buckets. OpenSubtitles (film subtitles — naturally short and conversational) filled the short bucket where Europarl was insufficient.
Evaluation Results
Evaluated on a held-out test set of 150,000 samples (15,000 per language), stratified across the same three length buckets.
Overall Performance
| Metric | Value |
|---|---|
| Accuracy | 0.9839 |
| F1 Macro | 0.9839 |
| F1 Weighted | 0.9839 |
| Unknown rate | 0.0% |
Performance by Text Length (Benchmark – 3,000 samples)
This is the most critical dimension for chat use cases:
| Length Bucket | This Model | Lingua-py | langdetect | langid.py | FastText | gcld3 |
|---|---|---|---|---|---|---|
| Short (3–30 chars) | 0.9680 | 0.8663 | 0.7215 | 0.6921 | 0.6813 | 0.6269 |
| Medium (31–80 chars) | 0.9960 | 0.9869 | 0.9568 | 0.9628 | 0.9019 | 0.9185 |
| Long (81–150 chars) | 0.9990 | 0.9990 | 0.9960 | 0.9930 | 0.9494 | 0.9392 |
Per-Language F1 (Test Set)
| Language | Precision | Recall | F1 |
|---|---|---|---|
| el (Greek) | 0.9989 | 0.9981 | 0.9985 |
| hu (Hungarian) | 0.9917 | 0.9895 | 0.9906 |
| fr (French) | 0.9889 | 0.9930 | 0.9910 |
| en (English) | 0.9826 | 0.9932 | 0.9879 |
| de (German) | 0.9848 | 0.9882 | 0.9865 |
| pl (Polish) | 0.9836 | 0.9863 | 0.9849 |
| hr (Croatian) | 0.9755 | 0.9809 | 0.9782 |
| cs (Czech) | 0.9769 | 0.9736 | 0.9753 |
| sl (Slovenian) | 0.9813 | 0.9703 | 0.9757 |
| sk (Slovak) | 0.9747 | 0.9659 | 0.9703 |
Greek achieves near-perfect scores due to its unique script. Slovak and Czech show the lowest scores due to high lexical similarity — a known challenge across all language identification systems.
Comparison with Popular Tools (F1 Macro, 3,000 samples)
| Tool | F1 Macro | Short F1 | Latency | Throughput | Unknown Rate |
|---|---|---|---|---|---|
| This model (GPU) | 0.9877 | 0.9680 | 0.83 ms | 1,202/s | 0.0% |
| Lingua-py | 0.9509 | 0.8663 | 1.03 ms | 967/s | 0.0% |
| langdetect | 0.8895 | 0.7215 | 3.29 ms | 304/s | 6.5% |
| langid.py | 0.8829 | 0.6921 | 0.11 ms | 9,142/s | 0.0% |
| FastText lid.176 | 0.8491 | 0.6813 | 0.02 ms | 45,251/s | 9.8% |
| gcld3 | 0.8326 | 0.6269 | 0.08 ms | 12,661/s | 15.6% |
Benchmark hardware: NVIDIA RTX 3060 12 GB (model on GPU, other tools on CPU).
All tools evaluated on identical stratified samples from the same test set.
Training Configuration
TrainingArguments(
num_train_epochs=5,
per_device_train_batch_size=64,
gradient_accumulation_steps=2, # effective batch: 128
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
weight_decay=0.01,
bf16=True, # BF16 — native on Ampere
eval_strategy="epoch",
metric_for_best_model="f1_macro",
load_best_model_at_end=True,
)
Training was stopped at epoch 5 (early stopping patience = 3). Best checkpoint at epoch 4 (F1 macro = 0.9840 on validation set).
Limitations
- Language scope: Only the 10 listed languages are supported. Inputs in other languages will be misclassified as one of the 10.
- Slavic pairs: Slovak/Czech and Slovenian/Croatian confusion is inherent to the task due to lexical overlap. This is the primary error source (sk→cs: 1.4%, sl→hr: 1.5%).
- Script sensitivity: The model relies on character-level features. Transliterated or romanized text (e.g., Greek in Latin script) may degrade performance.
- Latency: CPU inference is ~5–10× slower than GPU. Consider ONNX export for CPU-only deployments.
Intended Use
This model is intended for:
- non-commercial / research only
- Language routing in multilingual chat applications
- Preprocessing pipelines for NLP tasks requiring language-specific models
- Content moderation systems operating across the listed 10 languages
It is not intended for:
- Detecting languages outside the 10 supported ones
- Processing very long documents (truncated at 256 tokens)
- Production use without human review in high-stakes decisions
Contact
Citation
If you use this model, please cite the training corpora:
@inproceedings{koehn2005europarl,
title = {Europarl: A Parallel Corpus for Statistical Machine Translation},
author = {Koehn, Philipp},
booktitle = {Proceedings of the MT Summit},
year = {2005}
}
@inproceedings{lison2016opensubtitles,
title = {OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles},
author = {Lison, Pierre and Tiedemann, J{\"o}rg},
booktitle = {LREC},
year = {2016}
}
@misc{macocu2022,
title = {MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data},
author = {Ba{\~n}{\'o}n, Marta and others},
year = {2022}
}
- Downloads last month
- -
Model tree for GaborMadarasz/langdetect-distilbert-10lang-CE
Evaluation results
- Accuracy (test set)self-reported0.984
- F1 Macro (test set)self-reported0.984