| --- |
| library_name: transformers |
| license: mit |
| base_model: xlm-roberta-base |
| tags: |
| - generated_from_trainer |
| metrics: |
| - precision |
| - recall |
| - f1 |
| - accuracy |
| language: |
| - multilingual |
| - af |
| - am |
| - ar |
| - as |
| - ba |
| - be |
| - bg |
| - bn |
| - bo |
| - br |
| - bs |
| - ca |
| - ce |
| - ckb |
| - cs |
| - cy |
| - da |
| - de |
| - dv |
| - el |
| - en |
| - eo |
| - es |
| - et |
| - eu |
| - fa |
| - fi |
| - fr |
| - ga |
| - gd |
| - gl |
| - gu |
| - he |
| - hi |
| - hr |
| - hu |
| - hy |
| - id |
| - is |
| - it |
| - ja |
| - jv |
| - ka |
| - kk |
| - km |
| - kn |
| - ko |
| - ku |
| - ky |
| - la |
| - lb |
| - lo |
| - lt |
| - lv |
| - mg |
| - mk |
| - ml |
| - mn |
| - mr |
| - ms |
| - mt |
| - my |
| - ne |
| - nl |
| - no |
| - ny |
| - oc |
| - om |
| - or |
| - pa |
| - pl |
| - ps |
| - pt |
| - rm |
| - ro |
| - ru |
| - sd |
| - si |
| - sk |
| - sl |
| - so |
| - sq |
| - sr |
| - su |
| - sv |
| - sw |
| - ta |
| - te |
| - tg |
| - th |
| - ti |
| - tl |
| - tr |
| - tt |
| - ug |
| - uk |
| - ur |
| - uz |
| - vi |
| - yo |
| - zh |
| - zu |
| model-index: |
| - name: polyglot-tagger |
| results: [] |
| datasets: |
| - wikimedia/wikipedia |
| - HuggingFaceFW/finetranslations |
| - google/smol |
| - DerivedFunction/nlp-noise-snippets |
| - DerivedFunction/wikipedia-language-snippets-filtered |
| - DerivedFunction/finetranslations-filtered |
| - DerivedFunction/language-ner |
| pipeline_tag: token-classification |
| --- |
| |
|  |
|
|
|
|
| This model is experimental, see `polyglot-tagger-v2` for the latest version. |
|
|
| Fine-tuned `xlm-roberta-base` for sentence-level language tagging across 100 languages. |
| The model predicts BIO-style language tags over tokens, which makes it useful for |
| language identification, code-switch detection, and multilingual document analysis. |
|
|
|
|
|
|
| ## Model description (Experimental Version) |
|
|
| Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model |
| generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language. |
|
|
| ## Intended uses & limitations |
| This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks. |
| Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts. For example, English and German, Spanish and Portuguese, and Russian and Ukrainian. |
|
|
| The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental |
| and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results. |
|
|
| > Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi. |
|
|
| ### Training and Evaluation Data |
| A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage |
| is found in `DerivedFunction/language-ner`. |
|
|
|
|
| ## Training procedure |
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 5e-05 |
| - train_batch_size: 72 |
| - eval_batch_size: 36 |
| - seed: 42 |
| - gradient_accumulation_steps: 2 |
| - total_train_batch_size: 144 |
| - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| - lr_scheduler_type: linear |
| - num_epochs: 2 |
| - mixed_precision_training: Native AMP |
|
|
| ### Training results |
|
|
| It achieves the following results on the evaluation set: |
| - Loss: 0.0452 |
| - Precision: 0.8626 |
| - Recall: 0.8916 |
| - F1: 0.8769 |
| - Accuracy: 0.9892 |
|
|
| | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |
| |:-------------:|:------:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:| |
| | 0.0730 | 0.0905 | 2500 | 0.1081 | 0.7241 | 0.8260 | 0.7717 | 0.9760 | |
| | 0.0622 | 0.1809 | 5000 | 0.1276 | 0.6822 | 0.8122 | 0.7416 | 0.9724 | |
| | 0.0556 | 0.2714 | 7500 | 0.0826 | 0.7701 | 0.8463 | 0.8064 | 0.9813 | |
| | 0.0504 | 0.3618 | 10000 | 0.0763 | 0.7916 | 0.8562 | 0.8226 | 0.9822 | |
| | 0.0480 | 0.4523 | 12500 | 0.0703 | 0.8025 | 0.8602 | 0.8304 | 0.9839 | |
| | 0.0408 | 0.5427 | 15000 | 0.0750 | 0.8072 | 0.8637 | 0.8345 | 0.9837 | |
| | 0.0443 | 0.6332 | 17500 | 0.0652 | 0.8149 | 0.8657 | 0.8395 | 0.9849 | |
| | 0.0403 | 0.7236 | 20000 | 0.0647 | 0.8298 | 0.8728 | 0.8507 | 0.9859 | |
| | 0.0413 | 0.8141 | 22500 | 0.0590 | 0.8253 | 0.8686 | 0.8464 | 0.9865 | |
| | 0.0367 | 0.9045 | 25000 | 0.0582 | 0.8288 | 0.8743 | 0.8510 | 0.9867 | |
| | 0.0395 | 0.9950 | 27500 | 0.0583 | 0.8304 | 0.8768 | 0.8530 | 0.9862 | |
| | 0.0338 | 1.0854 | 30000 | 0.0567 | 0.8353 | 0.8783 | 0.8562 | 0.9869 | |
| | 0.0291 | 1.1759 | 32500 | 0.0537 | 0.8443 | 0.8786 | 0.8611 | 0.9878 | |
| | 0.0300 | 1.2663 | 35000 | 0.0521 | 0.8435 | 0.8805 | 0.8616 | 0.9878 | |
| | 0.0269 | 1.3568 | 37500 | 0.0531 | 0.8515 | 0.8859 | 0.8683 | 0.9879 | |
| | 0.0295 | 1.4472 | 40000 | 0.0517 | 0.8548 | 0.8882 | 0.8712 | 0.9882 | |
| | 0.0279 | 1.5377 | 42500 | 0.0489 | 0.8550 | 0.8884 | 0.8714 | 0.9884 | |
| | 0.0281 | 1.6281 | 45000 | 0.0480 | 0.8551 | 0.8875 | 0.8710 | 0.9887 | |
| | 0.0277 | 1.7186 | 47500 | 0.0467 | 0.8605 | 0.8904 | 0.8752 | 0.9888 | |
| | 0.0289 | 1.8090 | 50000 | 0.0458 | 0.8599 | 0.8919 | 0.8756 | 0.9892 | |
| | 0.0268 | 1.8995 | 52500 | 0.0457 | 0.8623 | 0.8906 | 0.8762 | 0.9891 | |
| | 0.0306 | 1.9899 | 55000 | 0.0452 | 0.8626 | 0.8916 | 0.8769 | 0.9892 | |
|
|
|
|
| ### Framework versions |
|
|
| - Transformers 5.0.0 |
| - Pytorch 2.10.0+cu128 |
| - Datasets 4.0.0 |
| - Tokenizers 0.22.2 |