DerivedFunction
/

polyglot-tagger-60L-Experimental

@@ -11,14 +11,14 @@ metrics:
 - f1
 - accuracy
 model-index:
-- name: language-identification
   results: []
 ---
-# Language Identification
-This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0404
 - Precision: 0.8848
@@ -31,14 +31,45 @@ It achieves the following results on the evaluation set:
 More information needed
 ## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -79,3 +110,4 @@ The following hyperparameters were used during training:
 - Pytorch 2.10.0+cu128
 - Datasets 4.0.0
 - Tokenizers 0.22.2

 - f1
 - accuracy
 model-index:
+- name: polyglot-tagger
   results: []
 ---
+# Polyglot Tagger: 60L
+This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
 It achieves the following results on the evaluation set:
 - Loss: 0.0404
 - Precision: 0.8848
 More information needed
 ## Intended uses & limitations
+This model can be treated as a base model for further fine-tuning on specific language extraction tasks. Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
+### Training and Evaluation Data
+The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families (Latin, Cyrillic, Indic, Arabic, Han, etc.), from Wikipedia (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles, by taking the first half of Wikipedia after filtering for stubs), google/smol (up to 1000 individual sentences), and finetranslations (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows), in which it is split into a reserve set for pure documents, and a main set for everything else.
+The data composition follows a strategic curriculum:
+* **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language.
+* **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection.
+* **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination.
+### Supported Languages and Limitations (60)
+The model supports the following ISO-coded languages. Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi:
+`af, am, ar, as, be, bg, bn, cs, da, de, el, en, es, fa, fi, fr, gu, he, hi, hu, hy, id, is, it, ja, ka, kk, km, kn, ko, la, lo, ml, mk, mn, mr, ms, my, nl, no, or, pa, pl, ps, pt, ro, ru, sd, sq, sr, sv, ta, te, th, tr, ug, uk, ur, vi, zh`
+### The model scored the following on `papulca/language-identification's test set
+Language     Correct    Total      Accuracy
+--------------------------------------------
+ar           114        114             100.0%
+bg           109        110              99.1%
+de           104        106              98.1%
+el           106        106             100.0%
+en           73         95               76.8%
+es           102        104              98.1%
+fr           102        102             100.0%
+hi           85         87               97.7%
+it           98         101              97.0%
+ja           94         94              100.0%
+nl           95         97               97.9%
+pl           100        104              96.2%
+pt           100        101              99.0%
+ru           116        117              99.1%
+th           108        108             100.0%
+tr           83         83              100.0%
+ur           92         94               97.9%
+vi           87         87              100.0%
+zh           100        100             100.0%
 ### Training hyperparameters
 - Pytorch 2.10.0+cu128
 - Datasets 4.0.0
 - Tokenizers 0.22.2