Update README.md
Browse files
README.md
CHANGED
|
@@ -11,14 +11,14 @@ metrics:
|
|
| 11 |
- f1
|
| 12 |
- accuracy
|
| 13 |
model-index:
|
| 14 |
-
- name:
|
| 15 |
results: []
|
| 16 |
---
|
| 17 |
|
| 18 |
|
| 19 |
-
#
|
| 20 |
|
| 21 |
-
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
|
| 22 |
It achieves the following results on the evaluation set:
|
| 23 |
- Loss: 0.0404
|
| 24 |
- Precision: 0.8848
|
|
@@ -31,14 +31,45 @@ It achieves the following results on the evaluation set:
|
|
| 31 |
More information needed
|
| 32 |
|
| 33 |
## Intended uses & limitations
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
### Training hyperparameters
|
| 44 |
|
|
@@ -79,3 +110,4 @@ The following hyperparameters were used during training:
|
|
| 79 |
- Pytorch 2.10.0+cu128
|
| 80 |
- Datasets 4.0.0
|
| 81 |
- Tokenizers 0.22.2
|
|
|
|
|
|
| 11 |
- f1
|
| 12 |
- accuracy
|
| 13 |
model-index:
|
| 14 |
+
- name: polyglot-tagger
|
| 15 |
results: []
|
| 16 |
---
|
| 17 |
|
| 18 |
|
| 19 |
+
# Polyglot Tagger: 60L
|
| 20 |
|
| 21 |
+
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
|
| 22 |
It achieves the following results on the evaluation set:
|
| 23 |
- Loss: 0.0404
|
| 24 |
- Precision: 0.8848
|
|
|
|
| 31 |
More information needed
|
| 32 |
|
| 33 |
## Intended uses & limitations
|
| 34 |
+
This model can be treated as a base model for further fine-tuning on specific language extraction tasks. Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts.
|
| 35 |
+
|
| 36 |
+
### Training and Evaluation Data
|
| 37 |
+
The model was trained on a synthetic dataset of roughly **2.5 million samples**, covering 60 languages across diverse script families (Latin, Cyrillic, Indic, Arabic, Han, etc.), from Wikipedia (up to 200,000 individual sentences, 120,000 reserve from up to 100,000 unique articles, by taking the first half of Wikipedia after filtering for stubs), google/smol (up to 1000 individual sentences), and finetranslations (up to 50,000 sentences, 30,000 reserve from up to 50,000 unique rows), in which it is split into a reserve set for pure documents, and a main set for everything else.
|
| 38 |
+
|
| 39 |
+
The data composition follows a strategic curriculum:
|
| 40 |
+
|
| 41 |
+
* **60% Pure Documents:** Single-language sequences to establish strong baseline profiles for each language.
|
| 42 |
+
* **30% Homogenous Mixed:** Documents containing one main language, and clear transitions between two or more languages to train boundary detection.
|
| 43 |
+
* **10% Mixed with Noise:** Integration of "neutral" spans including code snippets, mathematical notation, emojis, symbols, and `rot_13` text tagged as `O` or their respective source to reduce hallucination.
|
| 44 |
+
|
| 45 |
+
### Supported Languages and Limitations (60)
|
| 46 |
+
The model supports the following ISO-coded languages. Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi:
|
| 47 |
+
`af, am, ar, as, be, bg, bn, cs, da, de, el, en, es, fa, fi, fr, gu, he, hi, hu, hy, id, is, it, ja, ka, kk, km, kn, ko, la, lo, ml, mk, mn, mr, ms, my, nl, no, or, pa, pl, ps, pt, ro, ru, sd, sq, sr, sv, ta, te, th, tr, ug, uk, ur, vi, zh`
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
### The model scored the following on `papulca/language-identification's test set
|
| 51 |
+
|
| 52 |
+
Language Correct Total Accuracy
|
| 53 |
+
--------------------------------------------
|
| 54 |
+
ar 114 114 100.0%
|
| 55 |
+
bg 109 110 99.1%
|
| 56 |
+
de 104 106 98.1%
|
| 57 |
+
el 106 106 100.0%
|
| 58 |
+
en 73 95 76.8%
|
| 59 |
+
es 102 104 98.1%
|
| 60 |
+
fr 102 102 100.0%
|
| 61 |
+
hi 85 87 97.7%
|
| 62 |
+
it 98 101 97.0%
|
| 63 |
+
ja 94 94 100.0%
|
| 64 |
+
nl 95 97 97.9%
|
| 65 |
+
pl 100 104 96.2%
|
| 66 |
+
pt 100 101 99.0%
|
| 67 |
+
ru 116 117 99.1%
|
| 68 |
+
th 108 108 100.0%
|
| 69 |
+
tr 83 83 100.0%
|
| 70 |
+
ur 92 94 97.9%
|
| 71 |
+
vi 87 87 100.0%
|
| 72 |
+
zh 100 100 100.0%
|
| 73 |
|
| 74 |
### Training hyperparameters
|
| 75 |
|
|
|
|
| 110 |
- Pytorch 2.10.0+cu128
|
| 111 |
- Datasets 4.0.0
|
| 112 |
- Tokenizers 0.22.2
|
| 113 |
+
|