Update README.md

163392a verified 20 days ago

2.96 kB

library_name: transformers
license: mit
base_model: xlm-roberta-base
tags:
  - generated_from_trainer
  - language-identification
metrics:
  - precision
  - recall
  - f1
  - accuracy
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - ba
  - be
  - bg
  - bn
  - bo
  - br
  - bs
  - ca
  - ce
  - ckb
  - cs
  - cy
  - da
  - de
  - dv
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - ga
  - gd
  - gl
  - gu
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lb
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - ne
  - nl
  - 'no'
  - ny
  - oc
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - rm
  - ro
  - ru
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - ti
  - tl
  - tr
  - tt
  - ug
  - uk
  - ur
  - uz
  - vi
  - yo
  - yi
  - zh
  - zu
model-index:
  - name: polyglot-tagger
    results: []
datasets:
  - wikimedia/wikipedia
  - HuggingFaceFW/finetranslations
  - google/smol
  - polyglot-tagger/nlp-noise-snippets
  - polyglot-tagger/wikipedia-language-snippets-filtered
  - polyglot-tagger/finetranslations-filtered
  - polyglot-tagger/tatoeba-filtered
pipeline_tag: text-classification

Polyglot Tagger: Multi-label Language Identification

Refer to polyglot-tagger/language-identification. It is trained on the same dataset as a text-classifier rather than as a token classifier.

This model is a fine-tuned version of xlm-roberta-base. It achieves the following results on the evaluation set:

Loss: 0.0123
Precision: 0.9859
Recall: 0.9831
F1: 0.9845
Accuracy: 0.9412

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 32
seed: 42
gradient_accumulation_steps: 18
total_train_batch_size: 576
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 2
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Accuracy	F1	Validation Loss	Precision	Recall
0.2186	0.2925	2500	0.8560	0.9651	0.0395	0.9778	0.9528
0.1331	0.5851	5000	0.0232	0.9803	0.9717	0.9760	0.9070
0.1044	0.8776	7500	0.0172	0.9828	0.9774	0.9801	0.9218
0.0851	1.1700	10000	0.0150	0.9844	0.9801	0.9822	0.9311
0.0783	1.4626	12500	0.0136	0.9859	0.9809	0.9834	0.9354
0.0705	1.7551	15000	0.0126	0.9861	0.9826	0.9843	0.9399
0.0692	2.0	17094	0.0123	0.9859	0.9831	0.9845	0.9412

Framework versions

Transformers 5.5.4
Pytorch 2.11.0+cu128
Datasets 4.8.4
Tokenizers 0.22.2