DerivedFunction commited on
Commit
c9a12a8
·
verified ·
1 Parent(s): a48d33f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -11
README.md CHANGED
@@ -11,14 +11,14 @@ metrics:
11
  - f1
12
  - accuracy
13
  model-index:
14
- - name: language-identification
15
  results: []
16
  ---
17
 
18
 
19
- # Language Identification
20
 
21
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.0404
24
  - Precision: 0.8848
@@ -31,14 +31,45 @@ It achieves the following results on the evaluation set:
31
  More information needed
32
 
33
  ## Intended uses & limitations
34
-
35
- More information needed
36
-
37
- ## Training and evaluation data
38
-
39
- More information needed
40
-
41
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ### Training hyperparameters
44
 
@@ -79,3 +110,4 @@ The following hyperparameters were used during training:
79
  - Pytorch 2.10.0+cu128
80
  - Datasets 4.0.0
81
  - Tokenizers 0.22.2
 
 
11
  - f1
12
  - accuracy
13
  model-index:
14
+ - name: polyglot-tagger
15
  results: []
16
  ---
17
 
18
 
19
+ # Polyglot Tagger: 60L
20
 
21
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.0404
24
  - Precision: 0.8848
 
31
  More information needed
32
 
33
  ## Intended uses & limitations
34
+ This model can be treated as a base model for further fine-tuning on specific language extraction tasks. Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
35
+
36
+ ### Training and Evaluation Data
37
+ The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families (Latin, Cyrillic, Indic, Arabic, Han, etc.), from Wikipedia (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles, by taking the first half of Wikipedia after filtering for stubs), google/smol (up to 1000 individual sentences), and finetranslations (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows), in which it is split into a reserve set for pure documents, and a main set for everything else.
38
+
39
+ The data composition follows a strategic curriculum:
40
+
41
+ * **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language.
42
+ * **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection.
43
+ * **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination.
44
+
45
+ ### Supported Languages and Limitations (60)
46
+ The model supports the following ISO-coded languages. Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi:
47
+ `af, am, ar, as, be, bg, bn, cs, da, de, el, en, es, fa, fi, fr, gu, he, hi, hu, hy, id, is, it, ja, ka, kk, km, kn, ko, la, lo, ml, mk, mn, mr, ms, my, nl, no, or, pa, pl, ps, pt, ro, ru, sd, sq, sr, sv, ta, te, th, tr, ug, uk, ur, vi, zh`
48
+
49
+
50
+ ### The model scored the following on `papulca/language-identification's test set
51
+
52
+ Language Correct Total Accuracy
53
+ --------------------------------------------
54
+ ar 114 114 100.0%
55
+ bg 109 110 99.1%
56
+ de 104 106 98.1%
57
+ el 106 106 100.0%
58
+ en 73 95 76.8%
59
+ es 102 104 98.1%
60
+ fr 102 102 100.0%
61
+ hi 85 87 97.7%
62
+ it 98 101 97.0%
63
+ ja 94 94 100.0%
64
+ nl 95 97 97.9%
65
+ pl 100 104 96.2%
66
+ pt 100 101 99.0%
67
+ ru 116 117 99.1%
68
+ th 108 108 100.0%
69
+ tr 83 83 100.0%
70
+ ur 92 94 97.9%
71
+ vi 87 87 100.0%
72
+ zh 100 100 100.0%
73
 
74
  ### Training hyperparameters
75
 
 
110
  - Pytorch 2.10.0+cu128
111
  - Datasets 4.0.0
112
  - Tokenizers 0.22.2
113
+