| --- |
| library_name: transformers |
| license: mit |
| base_model: xlm-roberta-base |
| tags: |
| - language-detection |
| - language-identification |
| metrics: |
| - precision |
| - recall |
| - f1 |
| - accuracy |
| model-index: |
| - name: polyglot-tagger |
| results: [] |
| datasets: |
| - wikimedia/wikipedia |
| - HuggingFaceFW/finetranslations |
| - google/smol |
| - DerivedFunction/nlp-noise-snippets |
| - DerivedFunction/wikipedia-language-snippets-filtered |
| - DerivedFunction/finetranslations-filtered |
| - DerivedFunction/additional-language-snippets |
| pipeline_tag: token-classification |
| language: |
| - en |
| - es |
| - fr |
| - de |
| - it |
| - pt |
| - nl |
| - vi |
| - tr |
| - la |
| - id |
| - ms |
| - af |
| - sq |
| - is |
| - no |
| - sv |
| - da |
| - fi |
| - hu |
| - pl |
| - cs |
| - ro |
| - ru |
| - bg |
| - uk |
| - sr |
| - be |
| - kk |
| - mk |
| - mn |
| - zh |
| - ja |
| - ko |
| - hi |
| - ur |
| - bn |
| - ta |
| - te |
| - mr |
| - gu |
| - kn |
| - ml |
| - pa |
| - as |
| - or |
| - ar |
| - fa |
| - ps |
| - sd |
| - ug |
| - el |
| - he |
| - hy |
| - ka |
| - am |
| - km |
| - lo |
| - my |
| - th |
| - si |
| - bo |
| - dv |
| - ti |
| - sw |
| - eu |
| --- |
| |
|
|
| # Polyglot Tagger: 67L (Experimental) |
|
|
| This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). |
| It achieves the following results on the evaluation set: |
| - Loss: 0.0404 |
| - Precision: 0.8848 |
| - Recall: 0.9012 |
| - F1: 0.8929 |
| - Accuracy: 0.9909 |
|
|
| ## Model description |
|
|
| Introducing Polyglot Tagger 66L, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model |
| generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language. |
|
|
| ## Intended uses & limitations |
| This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks. |
| Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts. For example, English and German, Spanish and Portuguese, and Russian and Ukrainian. |
|
|
| The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental |
| and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results. |
|
|
| ### Training and Evaluation Data |
| The model was trained on a synthetic dataset of roughly **3 million samples**, covering 67 languages across diverse script families |
| (Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles, |
| by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences), `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows), |
| and additional sentences from various sources for major languages (`en`, `es`, `pt`, `ru`, `hi`, `de`, `fr`, etc) (up to 50,000 sentences, 30,000 reserve from up to 100,000 unique rows). |
| in which it is split into a reserve set for pure documents, and a main set for everything else. |
|
|
| A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. |
|
|
| The data composition follows a strategic curriculum: |
|
|
| * **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language. |
| * **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection. |
| * **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination. |
|
|
| ### Supported Languages and Limitations (66) |
| The model supports the following ISO-coded languages: |
| `af, am, ar, as, be, bg, bn, bo, cs, da, de, dv, el, en, es, eu, fa, fi, fr, gu, he, hi, |
| hu, hy, id, is, it, ja, ka, kk, km, kn, ko, la, lo, ml, mk, mn, mr, ms, my, nl, no, |
| or, pa, pl, ps, pt, ro, ru, sd, si, sq, sr, sv, sw, ta, te, th, ti, tr, ug, uk, ur, vi, zh` |
|
|
| > Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi. |
|
|
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 5e-05 |
| - train_batch_size: 72 |
| - eval_batch_size: 36 |
| - seed: 42 |
| - gradient_accumulation_steps: 2 |
| - total_train_batch_size: 144 |
| - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| - lr_scheduler_type: linear |
| - num_epochs: 2 |
| - mixed_precision_training: Native AMP |
|
|
| ### Training results |
|
|
| | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |
| |:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:| |
| | 0.0404 | 0.1206 | 2500 | 0.0649 | 0.7944 | 0.8616 | 0.8266 | 0.9868 | |
| | 0.0394 | 0.2412 | 5000 | 0.0538 | 0.8181 | 0.8696 | 0.8430 | 0.9893 | |
| | 0.0345 | 0.3618 | 7500 | 0.0456 | 0.8355 | 0.8781 | 0.8563 | 0.9906 | |
| | 0.0280 | 0.4824 | 10000 | 0.0493 | 0.8404 | 0.8836 | 0.8614 | 0.9897 | |
| | 0.0286 | 0.6030 | 12500 | 0.0515 | 0.8425 | 0.8805 | 0.8611 | 0.9889 | |
| | 0.0275 | 0.7236 | 15000 | 0.0423 | 0.8371 | 0.8852 | 0.8605 | 0.9905 | |
| | 0.0209 | 0.8442 | 17500 | 0.0429 | 0.8671 | 0.8908 | 0.8788 | 0.9911 | |
| | 0.0265 | 0.9648 | 20000 | 0.0379 | 0.8550 | 0.8881 | 0.8712 | 0.9919 | |
| | 0.0223 | 1.0854 | 22500 | 0.0371 | 0.8665 | 0.8967 | 0.8814 | 0.9918 | |
| | 0.0220 | 1.2060 | 25000 | 0.0344 | 0.8687 | 0.8954 | 0.8818 | 0.9926 | |
| | 0.0225 | 1.3266 | 27500 | 0.0332 | 0.8776 | 0.9011 | 0.8892 | 0.9928 | |
| | 0.0186 | 1.4472 | 30000 | 0.0390 | 0.8711 | 0.9018 | 0.8862 | 0.9920 | |
| | 0.0200 | 1.5678 | 32500 | 0.0315 | 0.8840 | 0.9046 | 0.8942 | 0.9931 | |
| | 0.0170 | 1.6884 | 35000 | 0.0313 | 0.8867 | 0.9066 | 0.8965 | 0.9932 | |
| | 0.0170 | 1.8090 | 37500 | 0.0305 | 0.8804 | 0.9034 | 0.8918 | 0.9933 | |
| | 0.0176 | 1.9296 | 40000 | 0.0305 | 0.8866 | 0.9058 | 0.8961 | 0.9935 | |
|
|
|
|
| ### Framework versions |
|
|
| - Transformers 5.0.0 |
| - Pytorch 2.10.0+cu128 |
| - Datasets 4.0.0 |
| - Tokenizers 0.22.2 |