| --- |
| library_name: transformers |
| license: mit |
| base_model: xlm-roberta-base |
| tags: |
| - language-detection |
| - language-identification |
| metrics: |
| - precision |
| - recall |
| - f1 |
| - accuracy |
| model-index: |
| - name: polyglot-tagger |
| results: [] |
| datasets: |
| - wikimedia/wikipedia |
| - HuggingFaceFW/finetranslations |
| - google/smol |
| - DerivedFunction/nlp-noise-snippets |
| - DerivedFunction/wikipedia-language-snippets-filtered |
| - DerivedFunction/finetranslations-filtered |
| pipeline_tag: token-classification |
| language: |
| - en |
| - es |
| - fr |
| - de |
| - it |
| - pt |
| - nl |
| - vi |
| - tr |
| - la |
| - id |
| - ms |
| - af |
| - sq |
| - is |
| - no |
| - sv |
| - da |
| - fi |
| - hu |
| - pl |
| - cs |
| - ro |
| - ru |
| - bg |
| - uk |
| - sr |
| - be |
| - kk |
| - mk |
| - mn |
| - zh |
| - ja |
| - ko |
| - hi |
| - ur |
| - bn |
| - ta |
| - te |
| - mr |
| - gu |
| - kn |
| - ml |
| - pa |
| - as |
| - or |
| - ar |
| - fa |
| - ps |
| - sd |
| - ug |
| - el |
| - he |
| - hy |
| - ka |
| - am |
| - km |
| - lo |
| - my |
| - th |
| --- |
| |
|
|
| # Polyglot Tagger: 60L (Experimental) |
|
|
| This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). |
| It achieves the following results on the evaluation set: |
| - Loss: 0.0404 |
| - Precision: 0.8848 |
| - Recall: 0.9012 |
| - F1: 0.8929 |
| - Accuracy: 0.9909 |
|
|
| ## Model description |
|
|
| Introducing Polyglot Tagger 60L, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model |
| generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language. |
|
|
| ## Intended uses & limitations |
| This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks. |
| Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts. For example, English and German, Spanish and Portuguese, and Russian and Ukrainian. |
|
|
| The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental |
| and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results. |
|
|
| ### Training and Evaluation Data |
| The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families |
| (Latin, Cyrillic, Indic, Arabic, Han, etc.), from `wikimedia/wikipedia` (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles, |
| by taking the first half of Wikipedia after filtering for stubs), `google/smol` (up to 1000 individual sentences), |
| and `HuggingFaceFW/finetranslations` (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows), |
| in which it is split into a reserve set for pure documents, and a main set for everything else. |
|
|
| A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. |
|
|
| The data composition follows a strategic curriculum: |
|
|
| * **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language. |
| * **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection. |
| * **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination. |
|
|
| ### Supported Languages and Limitations (60) |
| The model supports the following ISO-coded languages: |
| `af, am, ar, as, be, bg, bn, cs, da, de, el, en, es, fa, fi, fr, gu, he, hi, hu, hy, id, is, it, ja, ka, kk, km, kn, ko, la, lo, ml, mk, mn, mr, ms, my, nl, no, or, pa, pl, ps, pt, ro, ru, sd, sq, sr, sv, ta, te, th, tr, ug, uk, ur, vi, zh` |
|
|
| > Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi. |
|
|
| The coverage is as follows from a sample: |
|
|
| Per-group coverage (examples / tokens): |
| | language | examples | tokens | |
| | --- | -- | -- | |
| | English | 47 examples | 3947 tokens | |
| | Russian | 47 examples | 3665 tokens | |
| | German | 58 examples | 4625 tokens | |
| | Japanese | 50 examples | 4188 tokens | |
| | Chinese | 60 examples | 4131 tokens | |
| | French | 40 examples | 3723 tokens | |
| | Spanish | 44 examples | 4756 tokens | |
| | Portuguese | 27 examples | 2130 tokens | |
| | Italian | 57 examples | 5178 tokens | |
| | Polish | 25 examples | 1753 tokens | |
| | Dutch | 35 examples | 2315 tokens | |
| | SoutheastAsianLatin | 114 examples | 8861 tokens | |
| | CentralEuropeanLatin | 125 examples | 9761 tokens | |
| | Korean | 38 examples | 3958 tokens | |
| | EastSlavicCyrillic | 85 examples | 7471 tokens | |
| | Arabic | 45 examples | 2508 tokens | |
| | BalkanCyrillic | 71 examples | 6231 tokens | |
| | Hindi | 33 examples | 3251 tokens | |
| | IndicOther | 261 examples | 40630 tokens | |
| | CentralAsianCyrillic | 57 examples | 3789 tokens | |
| | AfricanLatin | 82 examples | 5910 tokens | |
| | OtherScripts | 269 examples | 28603 tokens | |
|
|
| Top token languages: |
| ml 8197 |
| it 5178 |
| ta 4903 |
| he 4873 |
| es 4756 |
| de 4625 |
| kn 4613 |
| pa 4457 |
| ja 4188 |
| zh 4131 |
| uk 4007 |
| ko 3958 |
|
|
| ## Evaluation |
| > Please note that these results are not indicative that token classification can substitute for sequence classification. |
|
|
| ### The model scored the following on `papulca/language-identification`'s test set |
| |Language | Correct | Total | Accuracy | |
| |-------------|----------|-------------|--------| |
| |ar | 114 | 114 | 100.0% | |
| |bg | 109 | 110 | 99.1% | |
| |de | 104 | 106 | 98.1% | |
| |el | 106 | 106 | 100.0% | |
| |**en*** | **73** | **95** | **76.8%** | |
| |es | 102 | 104 | 98.1% | |
| |fr | 102 | 102 | 100.0% | |
| |hi | 85 | 87 | 97.7% | |
| |it | 98 | 101 | 97.0% | |
| |ja | 94 | 94 | 100.0% | |
| |nl | 95 | 97 | 97.9% | |
| |pl | 100 | 104 | 96.2% | |
| |pt | 100 | 101 | 99.0% | |
| |ru | 116 | 117 | 99.1% | |
| |th | 108 | 108 | 100.0% | |
| |tr | 83 | 83 | 100.0% | |
| |ur | 92 | 94 | 97.9% | |
| |vi | 87 | 87 | 100.0% | |
| |zh | 100 | 100 | 100.0% | |
|
|
| > As the training data is slightly biased toward English text, it may produce tokens for English rather than the target language in the Latin family. |
|
|
| ### The model scored the following on `mikaberidze/lid200`'s test set, which is derived from `Davlan/sib200` |
|
|
| |Language | Correct | Total | Accuracy |
| ------------|----------|-----------|----------- |
| |af | 204 | 204 | 100.0% |
| |am | 204 | 204 | 100.0% |
| |as | 204 | 204 | 100.0% |
| |be | 204 | 204 | 100.0% |
| |bg | 204 | 204 | 100.0% |
| |bn | 204 | 204 | 100.0% |
| |cs | 204 | 204 | 100.0% |
| |da | 203 | 204 |99.5% |
| |de | 204 | 204 | 100.0% |
| |el | 204 | 204 | 100.0% |
| |en | 204 | 204 | 100.0% |
| |es | 204 | 204 | 100.0% |
| |fi | 204 | 204 | 100.0% |
| |fr | 204 | 204 | 100.0% |
| |gu | 204 | 204 | 100.0% |
| |he | 204 | 204 | 100.0% |
| |hi | 204 | 204 | 100.0% |
| |hu | 204 | 204 | 100.0% |
| |hy | 204 | 204 | 100.0% |
| |id | 198 | 204 |97.1% |
| |is | 204 | 204 | 100.0% |
| |it | 204 | 204 | 100.0% |
| |ja | 204 | 204 | 100.0% |
| |ka | 204 | 204 | 100.0% |
| |kk | 204 | 204 | 100.0% |
| |km | 204 | 204 | 100.0% |
| |kn | 204 | 204 | 100.0% |
| |ko | 204 | 204 | 100.0% |
| |lo | 204 | 204 | 100.0% |
| |mk | 203 | 204 | 99.5% |
| |ml | 204 | 204 | 100.0% |
| |mr | 204 | 204 | 100.0% |
| |my | 204 | 204 | 100.0% |
| |nl | 203 | 204 |99.5% |
| |pa | 204 | 204 | 100.0% |
| |pl | 204 | 204 | 100.0% |
| |pt | 204 | 204 | 100.0% |
| |ro | 204 | 204 | 100.0% |
| |ru | 204 | 204 | 100.0% |
| |sd | 204 | 204 | 100.0% |
| |sr | 204 | 204 | 100.0% |
| |sv | 204 | 204 | 100.0% |
| |ta | 204 | 204 | 100.0% |
| |te | 204 | 204 | 100.0% |
| |th | 204 | 204 | 100.0% |
| |tr | 204 | 204 | 100.0% |
| |ug | 204 | 204 | 100.0% |
| |uk | 204 | 204 | 100.0% |
| |ur | 204 | 204 | 100.0% |
| |vi | 204 | 204 | 100.0% |
| |zh |408 | 408 | 100.0% |
|
|
| > Caution: training data include text from Wikipedia and Finetranslations, which may skew the results. |
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 5e-05 |
| - train_batch_size: 72 |
| - eval_batch_size: 36 |
| - seed: 42 |
| - gradient_accumulation_steps: 2 |
| - total_train_batch_size: 144 |
| - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| - lr_scheduler_type: linear |
| - num_epochs: 2 |
| - mixed_precision_training: Native AMP |
|
|
| ### Training results |
|
|
| | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |
| |:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:| |
| | 0.0465 | 0.1447 | 2500 | 0.0819 | 0.7945 | 0.8602 | 0.8260 | 0.9828 | |
| | 0.0440 | 0.2894 | 5000 | 0.0703 | 0.8023 | 0.8662 | 0.8330 | 0.9843 | |
| | 0.0351 | 0.4342 | 7500 | 0.0611 | 0.8427 | 0.8800 | 0.8609 | 0.9860 | |
| | 0.0314 | 0.5789 | 10000 | 0.0593 | 0.8542 | 0.8851 | 0.8694 | 0.9872 | |
| | 0.0329 | 0.7236 | 12500 | 0.0563 | 0.8394 | 0.8781 | 0.8583 | 0.9868 | |
| | 0.0281 | 0.8683 | 15000 | 0.0488 | 0.8595 | 0.8853 | 0.8722 | 0.9886 | |
| | 0.0274 | 1.0130 | 17500 | 0.0477 | 0.8623 | 0.8904 | 0.8761 | 0.9894 | |
| | 0.0236 | 1.1577 | 20000 | 0.0483 | 0.8675 | 0.8933 | 0.8802 | 0.9894 | |
| | 0.0235 | 1.3025 | 22500 | 0.0461 | 0.8720 | 0.8933 | 0.8825 | 0.9901 | |
| | 0.0195 | 1.4472 | 25000 | 0.0439 | 0.8755 | 0.8954 | 0.8853 | 0.9903 | |
| | 0.0222 | 1.5919 | 27500 | 0.0442 | 0.8765 | 0.8964 | 0.8863 | 0.9901 | |
| | 0.0194 | 1.7366 | 30000 | 0.0438 | 0.8803 | 0.8993 | 0.8897 | 0.9902 | |
| | 0.0200 | 1.8814 | 32500 | 0.0404 | 0.8848 | 0.9012 | 0.8929 | 0.9909 | |
|
|
|
|
| ### Framework versions |
|
|
| - Transformers 5.0.0 |
| - Pytorch 2.10.0+cu128 |
| - Datasets 4.0.0 |
| - Tokenizers 0.22.2 |