You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

langdetect-distilbert-10lang-CE

Fine-tuned distilbert-base-multilingual-cased for language identification across 10 European languages, with a focus on short, chat-style input (as short as 3 characters).

Model Description

Standard language detection libraries perform poorly on short texts (F1 ≈ 0.62–0.87 on inputs under 30 characters). This model was fine-tuned with a chat-oriented training distribution to close that gap, achieving F1 = 0.9680 on short texts — 10–34 percentage points above all evaluated general-purpose tools.

Property	Value
Base model	`distilbert/distilbert-base-multilingual-cased`
Task	10-class language classification
Languages	cs, de, el, en, fr, hr, hu, pl, sk, sl
Training samples	1,200,000 (120k/language)
Max input length	256 tokens
Precision	BF16 (Ampere GPU)

Supported Languages

Code	Language	Code	Language
`cs`	Czech	`hu`	Hungarian
`de`	German	`pl`	Polish
`el`	Greek	`sk`	Slovak
`en`	English	`sl`	Slovenian
`fr`	French	`hr`	Croatian

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="GaborMadarasz/langdetect-distilbert-10lang-CE",
    device=0,        # GPU; use -1 for CPU
)

classifier("Mikor érkezel haza?")
# [{'label': 'hu', 'score': 0.9997}]

classifier("Ok")
# [{'label': 'en', 'score': 0.9821}]

classifier("Bonjour!")
# [{'label': 'fr', 'score': 0.9993}]

Training Data

Training used a hybrid corpus combining three sources:

Source	Languages	Purpose
Europarl (1996–2011)	cs, de, el, en, fr, hu, pl, sk, sl	Medium and long sentences
OpenSubtitles 2018	All 10 languages	Short sentence fallback
MaCoCu-hr 2.0	hr	Croatian web corpus

Chat-Oriented Length Distribution

Standard corpora are dominated by long formal sentences. To reflect real chat traffic, training samples were bucket-sampled by character length:

Bucket	Range	Share	Rationale
Short	3–30 chars	40%	Single words, greetings, brief replies
Medium	31–80 chars	35%	Typical chat messages
Long	81–150 chars	25%	Multi-clause sentences

Europarl filled medium and long buckets. OpenSubtitles (film subtitles — naturally short and conversational) filled the short bucket where Europarl was insufficient.

Evaluation Results

Evaluated on a held-out test set of 150,000 samples (15,000 per language), stratified across the same three length buckets.

Overall Performance

Metric	Value
Accuracy	0.9839
F1 Macro	0.9839
F1 Weighted	0.9839
Unknown rate	0.0%

Performance by Text Length (Benchmark – 3,000 samples)

This is the most critical dimension for chat use cases:

Length Bucket	This Model	Lingua-py	langdetect	langid.py	FastText	gcld3
Short (3–30 chars)	0.9680	0.8663	0.7215	0.6921	0.6813	0.6269
Medium (31–80 chars)	0.9960	0.9869	0.9568	0.9628	0.9019	0.9185
Long (81–150 chars)	0.9990	0.9990	0.9960	0.9930	0.9494	0.9392

Per-Language F1 (Test Set)

Language	Precision	Recall	F1
el (Greek)	0.9989	0.9981	0.9985
hu (Hungarian)	0.9917	0.9895	0.9906
fr (French)	0.9889	0.9930	0.9910
en (English)	0.9826	0.9932	0.9879
de (German)	0.9848	0.9882	0.9865
pl (Polish)	0.9836	0.9863	0.9849
hr (Croatian)	0.9755	0.9809	0.9782
cs (Czech)	0.9769	0.9736	0.9753
sl (Slovenian)	0.9813	0.9703	0.9757
sk (Slovak)	0.9747	0.9659	0.9703

Greek achieves near-perfect scores due to its unique script. Slovak and Czech show the lowest scores due to high lexical similarity — a known challenge across all language identification systems.

Comparison with Popular Tools (F1 Macro, 3,000 samples)

Tool	F1 Macro	Short F1	Latency	Throughput	Unknown Rate
This model (GPU)	0.9877	0.9680	0.83 ms	1,202/s	0.0%
Lingua-py	0.9509	0.8663	1.03 ms	967/s	0.0%
langdetect	0.8895	0.7215	3.29 ms	304/s	6.5%
langid.py	0.8829	0.6921	0.11 ms	9,142/s	0.0%
FastText lid.176	0.8491	0.6813	0.02 ms	45,251/s	9.8%
gcld3	0.8326	0.6269	0.08 ms	12,661/s	15.6%

Benchmark hardware: NVIDIA RTX 3060 12 GB (model on GPU, other tools on CPU).
All tools evaluated on identical stratified samples from the same test set.

Training Configuration

TrainingArguments(
    num_train_epochs=5,
    per_device_train_batch_size=64,
    gradient_accumulation_steps=2,     # effective batch: 128
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    bf16=True,                         # BF16 — native on Ampere
    eval_strategy="epoch",
    metric_for_best_model="f1_macro",
    load_best_model_at_end=True,
)

Training was stopped at epoch 5 (early stopping patience = 3). Best checkpoint at epoch 4 (F1 macro = 0.9840 on validation set).

Limitations

Language scope: Only the 10 listed languages are supported. Inputs in other languages will be misclassified as one of the 10.
Slavic pairs: Slovak/Czech and Slovenian/Croatian confusion is inherent to the task due to lexical overlap. This is the primary error source (sk→cs: 1.4%, sl→hr: 1.5%).
Script sensitivity: The model relies on character-level features. Transliterated or romanized text (e.g., Greek in Latin script) may degrade performance.
Latency: CPU inference is ~5–10× slower than GPU. Consider ONNX export for CPU-only deployments.

Intended Use

This model is intended for:

non-commercial / research only
Language routing in multilingual chat applications
Preprocessing pipelines for NLP tasks requiring language-specific models
Content moderation systems operating across the listed 10 languages

It is not intended for:

Detecting languages outside the 10 supported ones
Processing very long documents (truncated at 256 tokens)
Production use without human review in high-stakes decisions

Contact

gabor.madarasz@gmail.com

Citation

If you use this model, please cite the training corpora:

@inproceedings{koehn2005europarl,
  title     = {Europarl: A Parallel Corpus for Statistical Machine Translation},
  author    = {Koehn, Philipp},
  booktitle = {Proceedings of the MT Summit},
  year      = {2005}
}

@inproceedings{lison2016opensubtitles,
  title     = {OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles},
  author    = {Lison, Pierre and Tiedemann, J{\"o}rg},
  booktitle = {LREC},
  year      = {2016}
}

@misc{macocu2022,
  title  = {MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data},
  author = {Ba{\~n}{\'o}n, Marta and others},
  year   = {2022}
}

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for GaborMadarasz/langdetect-distilbert-10lang-CE

Base model

distilbert/distilbert-base-multilingual-cased

Finetuned

(433)

this model

Evaluation results

Accuracy (test set)
self-reported

0.984
F1 Macro (test set)
self-reported

0.984