You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

langdetect-distilbert-10lang-CE

Fine-tuned distilbert-base-multilingual-cased for language identification across 10 European languages, with a focus on short, chat-style input (as short as 3 characters).

Model Description

Standard language detection libraries perform poorly on short texts (F1 ≈ 0.62–0.87 on inputs under 30 characters). This model was fine-tuned with a chat-oriented training distribution to close that gap, achieving F1 = 0.9680 on short texts — 10–34 percentage points above all evaluated general-purpose tools.

Property Value
Base model distilbert/distilbert-base-multilingual-cased
Task 10-class language classification
Languages cs, de, el, en, fr, hr, hu, pl, sk, sl
Training samples 1,200,000 (120k/language)
Max input length 256 tokens
Precision BF16 (Ampere GPU)

Supported Languages

Code Language Code Language
cs Czech hu Hungarian
de German pl Polish
el Greek sk Slovak
en English sl Slovenian
fr French hr Croatian

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="GaborMadarasz/langdetect-distilbert-10lang-CE",
    device=0,        # GPU; use -1 for CPU
)

classifier("Mikor érkezel haza?")
# [{'label': 'hu', 'score': 0.9997}]

classifier("Ok")
# [{'label': 'en', 'score': 0.9821}]

classifier("Bonjour!")
# [{'label': 'fr', 'score': 0.9993}]

Training Data

Training used a hybrid corpus combining three sources:

Source Languages Purpose
Europarl (1996–2011) cs, de, el, en, fr, hu, pl, sk, sl Medium and long sentences
OpenSubtitles 2018 All 10 languages Short sentence fallback
MaCoCu-hr 2.0 hr Croatian web corpus

Chat-Oriented Length Distribution

Standard corpora are dominated by long formal sentences. To reflect real chat traffic, training samples were bucket-sampled by character length:

Bucket Range Share Rationale
Short 3–30 chars 40% Single words, greetings, brief replies
Medium 31–80 chars 35% Typical chat messages
Long 81–150 chars 25% Multi-clause sentences

Europarl filled medium and long buckets. OpenSubtitles (film subtitles — naturally short and conversational) filled the short bucket where Europarl was insufficient.

Evaluation Results

Evaluated on a held-out test set of 150,000 samples (15,000 per language), stratified across the same three length buckets.

Overall Performance

Metric Value
Accuracy 0.9839
F1 Macro 0.9839
F1 Weighted 0.9839
Unknown rate 0.0%

Performance by Text Length (Benchmark – 3,000 samples)

This is the most critical dimension for chat use cases:

Length Bucket This Model Lingua-py langdetect langid.py FastText gcld3
Short (3–30 chars) 0.9680 0.8663 0.7215 0.6921 0.6813 0.6269
Medium (31–80 chars) 0.9960 0.9869 0.9568 0.9628 0.9019 0.9185
Long (81–150 chars) 0.9990 0.9990 0.9960 0.9930 0.9494 0.9392

Per-Language F1 (Test Set)

Language Precision Recall F1
el (Greek) 0.9989 0.9981 0.9985
hu (Hungarian) 0.9917 0.9895 0.9906
fr (French) 0.9889 0.9930 0.9910
en (English) 0.9826 0.9932 0.9879
de (German) 0.9848 0.9882 0.9865
pl (Polish) 0.9836 0.9863 0.9849
hr (Croatian) 0.9755 0.9809 0.9782
cs (Czech) 0.9769 0.9736 0.9753
sl (Slovenian) 0.9813 0.9703 0.9757
sk (Slovak) 0.9747 0.9659 0.9703

Greek achieves near-perfect scores due to its unique script. Slovak and Czech show the lowest scores due to high lexical similarity — a known challenge across all language identification systems.

Comparison with Popular Tools (F1 Macro, 3,000 samples)

Tool F1 Macro Short F1 Latency Throughput Unknown Rate
This model (GPU) 0.9877 0.9680 0.83 ms 1,202/s 0.0%
Lingua-py 0.9509 0.8663 1.03 ms 967/s 0.0%
langdetect 0.8895 0.7215 3.29 ms 304/s 6.5%
langid.py 0.8829 0.6921 0.11 ms 9,142/s 0.0%
FastText lid.176 0.8491 0.6813 0.02 ms 45,251/s 9.8%
gcld3 0.8326 0.6269 0.08 ms 12,661/s 15.6%

Benchmark hardware: NVIDIA RTX 3060 12 GB (model on GPU, other tools on CPU).
All tools evaluated on identical stratified samples from the same test set.

Training Configuration

TrainingArguments(
    num_train_epochs=5,
    per_device_train_batch_size=64,
    gradient_accumulation_steps=2,     # effective batch: 128
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    bf16=True,                         # BF16 — native on Ampere
    eval_strategy="epoch",
    metric_for_best_model="f1_macro",
    load_best_model_at_end=True,
)

Training was stopped at epoch 5 (early stopping patience = 3). Best checkpoint at epoch 4 (F1 macro = 0.9840 on validation set).

Limitations

  • Language scope: Only the 10 listed languages are supported. Inputs in other languages will be misclassified as one of the 10.
  • Slavic pairs: Slovak/Czech and Slovenian/Croatian confusion is inherent to the task due to lexical overlap. This is the primary error source (sk→cs: 1.4%, sl→hr: 1.5%).
  • Script sensitivity: The model relies on character-level features. Transliterated or romanized text (e.g., Greek in Latin script) may degrade performance.
  • Latency: CPU inference is ~5–10× slower than GPU. Consider ONNX export for CPU-only deployments.

Intended Use

This model is intended for:

  • non-commercial / research only
  • Language routing in multilingual chat applications
  • Preprocessing pipelines for NLP tasks requiring language-specific models
  • Content moderation systems operating across the listed 10 languages

It is not intended for:

  • Detecting languages outside the 10 supported ones
  • Processing very long documents (truncated at 256 tokens)
  • Production use without human review in high-stakes decisions

Contact

gabor.madarasz@gmail.com

Citation

If you use this model, please cite the training corpora:

@inproceedings{koehn2005europarl,
  title     = {Europarl: A Parallel Corpus for Statistical Machine Translation},
  author    = {Koehn, Philipp},
  booktitle = {Proceedings of the MT Summit},
  year      = {2005}
}

@inproceedings{lison2016opensubtitles,
  title     = {OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles},
  author    = {Lison, Pierre and Tiedemann, J{\"o}rg},
  booktitle = {LREC},
  year      = {2016}
}

@misc{macocu2022,
  title  = {MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data},
  author = {Ba{\~n}{\'o}n, Marta and others},
  year   = {2022}
}
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GaborMadarasz/langdetect-distilbert-10lang-CE

Finetuned
(433)
this model

Evaluation results