Token Classification
Transformers
TensorBoard
Safetensors
xlm-roberta
Generated from Trainer
language-identification
codeswitching
Instructions to use DerivedFunction/polyglot-tagger-v2.2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DerivedFunction/polyglot-tagger-v2.2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="DerivedFunction/polyglot-tagger-v2.2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("DerivedFunction/polyglot-tagger-v2.2") model = AutoModelForTokenClassification.from_pretrained("DerivedFunction/polyglot-tagger-v2.2") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -162,12 +162,125 @@ factors were used to simulate messy text, and to reduce single character bias on
|
|
| 162 |
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 163 |
- Low chance of simulating OCR and messy text with character mutation.
|
| 164 |
|
|
|
|
| 165 |
To generalize well on both the target language and code switching a circulumn is provided:
|
| 166 |
- Pure documents 55%: Single language to learn its vocabulary
|
| 167 |
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching
|
| 168 |
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 169 |
- Mixed 10%: Generic mix of any languages.
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
|
| 172 |
It achieves the following results on the evaluation set:
|
| 173 |
- Loss: 0.0345
|
|
|
|
| 162 |
- Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
|
| 163 |
- Low chance of simulating OCR and messy text with character mutation.
|
| 164 |
|
| 165 |
+
|
| 166 |
To generalize well on both the target language and code switching a circulumn is provided:
|
| 167 |
- Pure documents 55%: Single language to learn its vocabulary
|
| 168 |
- Homogenous 25%: Single language + one foreign sentence to learn simple code switching
|
| 169 |
- Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
|
| 170 |
- Mixed 10%: Generic mix of any languages.
|
| 171 |
|
| 172 |
+
|
| 173 |
+
| lang | train | train % | eval | eval % | all_splits | all_splits % |
|
| 174 |
+
| :--- | ---: | ---: | ---: | ---: | ---: | ---: |
|
| 175 |
+
| en | 158088 | 2.45% | 2785 | 3.77% | 160873 | 2.46% |
|
| 176 |
+
| ru | 125692 | 1.94% | 1915 | 2.59% | 127607 | 1.95% |
|
| 177 |
+
| es | 125219 | 1.94% | 1809 | 2.45% | 127028 | 1.94% |
|
| 178 |
+
| ja | 125212 | 1.94% | 1786 | 2.42% | 126998 | 1.94% |
|
| 179 |
+
| fr | 123594 | 1.91% | 1803 | 2.44% | 125397 | 1.92% |
|
| 180 |
+
| de | 121413 | 1.88% | 1714 | 2.32% | 123127 | 1.88% |
|
| 181 |
+
| zh | 120667 | 1.87% | 1745 | 2.36% | 122412 | 1.87% |
|
| 182 |
+
| pt | 119430 | 1.85% | 1749 | 2.37% | 121179 | 1.85% |
|
| 183 |
+
| it | 117478 | 1.82% | 1596 | 2.16% | 119074 | 1.82% |
|
| 184 |
+
| ar | 101539 | 1.57% | 1208 | 1.63% | 102747 | 1.57% |
|
| 185 |
+
| fi | 100922 | 1.56% | 1490 | 2.02% | 102412 | 1.57% |
|
| 186 |
+
| uk | 98233 | 1.52% | 1160 | 1.57% | 99393 | 1.52% |
|
| 187 |
+
| pl | 96600 | 1.49% | 1144 | 1.55% | 97744 | 1.50% |
|
| 188 |
+
| no | 93851 | 1.45% | 1471 | 1.99% | 95322 | 1.46% |
|
| 189 |
+
| hu | 92941 | 1.44% | 1077 | 1.46% | 94018 | 1.44% |
|
| 190 |
+
| tr | 92566 | 1.43% | 1053 | 1.42% | 93619 | 1.43% |
|
| 191 |
+
| nl | 91431 | 1.41% | 1067 | 1.44% | 92498 | 1.41% |
|
| 192 |
+
| he | 91496 | 1.42% | 832 | 1.13% | 92328 | 1.41% |
|
| 193 |
+
| cs | 91275 | 1.41% | 989 | 1.34% | 92264 | 1.41% |
|
| 194 |
+
| da | 87309 | 1.35% | 1223 | 1.65% | 88532 | 1.35% |
|
| 195 |
+
| lt | 86897 | 1.34% | 935 | 1.26% | 87832 | 1.34% |
|
| 196 |
+
| mk | 85658 | 1.33% | 824 | 1.11% | 86482 | 1.32% |
|
| 197 |
+
| eo | 83778 | 1.30% | 794 | 1.07% | 84572 | 1.29% |
|
| 198 |
+
| mr | 83264 | 1.29% | 744 | 1.01% | 84008 | 1.28% |
|
| 199 |
+
| ko | 81829 | 1.27% | 1005 | 1.36% | 82834 | 1.27% |
|
| 200 |
+
| hi | 81659 | 1.26% | 979 | 1.32% | 82638 | 1.26% |
|
| 201 |
+
| tl | 79985 | 1.24% | 966 | 1.31% | 80951 | 1.24% |
|
| 202 |
+
| hy | 76909 | 1.19% | 735 | 0.99% | 77644 | 1.19% |
|
| 203 |
+
| el | 75369 | 1.17% | 728 | 0.98% | 76097 | 1.16% |
|
| 204 |
+
| ro | 73559 | 1.14% | 763 | 1.03% | 74322 | 1.14% |
|
| 205 |
+
| is | 72356 | 1.12% | 987 | 1.34% | 73343 | 1.12% |
|
| 206 |
+
| sk | 71990 | 1.11% | 858 | 1.16% | 72848 | 1.11% |
|
| 207 |
+
| la | 70651 | 1.09% | 745 | 1.01% | 71396 | 1.09% |
|
| 208 |
+
| be | 70521 | 1.09% | 867 | 1.17% | 71388 | 1.09% |
|
| 209 |
+
| fa | 70584 | 1.09% | 717 | 0.97% | 71301 | 1.09% |
|
| 210 |
+
| bg | 69684 | 1.08% | 673 | 0.91% | 70357 | 1.08% |
|
| 211 |
+
| lv | 67627 | 1.05% | 691 | 0.93% | 68318 | 1.04% |
|
| 212 |
+
| ms | 66271 | 1.03% | 770 | 1.04% | 67041 | 1.03% |
|
| 213 |
+
| af | 64699 | 1.00% | 982 | 1.33% | 65681 | 1.00% |
|
| 214 |
+
| ckb | 64368 | 1.00% | 587 | 0.79% | 64955 | 0.99% |
|
| 215 |
+
| kk | 63640 | 0.98% | 621 | 0.84% | 64261 | 0.98% |
|
| 216 |
+
| eu | 63398 | 0.98% | 673 | 0.91% | 64071 | 0.98% |
|
| 217 |
+
| ka | 63201 | 0.98% | 523 | 0.71% | 63724 | 0.97% |
|
| 218 |
+
| mn | 62551 | 0.97% | 641 | 0.87% | 63192 | 0.97% |
|
| 219 |
+
| hr | 62427 | 0.97% | 711 | 0.96% | 63138 | 0.97% |
|
| 220 |
+
| oc | 62292 | 0.96% | 661 | 0.89% | 62953 | 0.96% |
|
| 221 |
+
| id | 62134 | 0.96% | 732 | 0.99% | 62866 | 0.96% |
|
| 222 |
+
| ky | 61634 | 0.95% | 637 | 0.86% | 62271 | 0.95% |
|
| 223 |
+
| ba | 61637 | 0.95% | 584 | 0.79% | 62221 | 0.95% |
|
| 224 |
+
| ur | 61550 | 0.95% | 578 | 0.78% | 62128 | 0.95% |
|
| 225 |
+
| th | 60731 | 0.94% | 576 | 0.78% | 61307 | 0.94% |
|
| 226 |
+
| bn | 60588 | 0.94% | 415 | 0.56% | 61003 | 0.93% |
|
| 227 |
+
| ps | 60342 | 0.93% | 533 | 0.72% | 60875 | 0.93% |
|
| 228 |
+
| sv | 59918 | 0.93% | 937 | 1.27% | 60855 | 0.93% |
|
| 229 |
+
| tt | 60177 | 0.93% | 634 | 0.86% | 60811 | 0.93% |
|
| 230 |
+
| pa | 60137 | 0.93% | 599 | 0.81% | 60736 | 0.93% |
|
| 231 |
+
| sw | 60148 | 0.93% | 558 | 0.75% | 60706 | 0.93% |
|
| 232 |
+
| kn | 60037 | 0.93% | 631 | 0.85% | 60668 | 0.93% |
|
| 233 |
+
| as | 59839 | 0.93% | 374 | 0.51% | 60213 | 0.92% |
|
| 234 |
+
| cy | 58188 | 0.90% | 655 | 0.89% | 58843 | 0.90% |
|
| 235 |
+
| jv | 57805 | 0.89% | 508 | 0.69% | 58313 | 0.89% |
|
| 236 |
+
| bs | 57399 | 0.89% | 655 | 0.89% | 58054 | 0.89% |
|
| 237 |
+
| ga | 57233 | 0.89% | 672 | 0.91% | 57905 | 0.89% |
|
| 238 |
+
| ca | 56547 | 0.87% | 606 | 0.82% | 57153 | 0.87% |
|
| 239 |
+
| gl | 55312 | 0.86% | 577 | 0.78% | 55889 | 0.85% |
|
| 240 |
+
| sl | 55017 | 0.85% | 598 | 0.81% | 55615 | 0.85% |
|
| 241 |
+
| ku | 54674 | 0.85% | 537 | 0.73% | 55211 | 0.84% |
|
| 242 |
+
| ne | 54102 | 0.84% | 440 | 0.60% | 54542 | 0.83% |
|
| 243 |
+
| uz | 53777 | 0.83% | 507 | 0.69% | 54284 | 0.83% |
|
| 244 |
+
| tg | 50762 | 0.79% | 502 | 0.68% | 51264 | 0.78% |
|
| 245 |
+
| br | 49263 | 0.76% | 554 | 0.75% | 49817 | 0.76% |
|
| 246 |
+
| et | 49249 | 0.76% | 511 | 0.69% | 49760 | 0.76% |
|
| 247 |
+
| lb | 48192 | 0.75% | 492 | 0.67% | 48684 | 0.74% |
|
| 248 |
+
| su | 48185 | 0.75% | 480 | 0.65% | 48665 | 0.74% |
|
| 249 |
+
| mt | 47694 | 0.74% | 446 | 0.60% | 48140 | 0.74% |
|
| 250 |
+
| sr | 47385 | 0.73% | 458 | 0.62% | 47843 | 0.73% |
|
| 251 |
+
| sq | 45528 | 0.70% | 514 | 0.70% | 46042 | 0.70% |
|
| 252 |
+
| ml | 43461 | 0.67% | 429 | 0.58% | 43890 | 0.67% |
|
| 253 |
+
| or | 41301 | 0.64% | 413 | 0.56% | 41714 | 0.64% |
|
| 254 |
+
| te | 40065 | 0.62% | 381 | 0.52% | 40446 | 0.62% |
|
| 255 |
+
| yi | 38484 | 0.60% | 353 | 0.48% | 38837 | 0.59% |
|
| 256 |
+
| ta | 35897 | 0.56% | 378 | 0.51% | 36275 | 0.55% |
|
| 257 |
+
| mg | 35133 | 0.54% | 342 | 0.46% | 35475 | 0.54% |
|
| 258 |
+
| si | 34611 | 0.54% | 343 | 0.46% | 34954 | 0.53% |
|
| 259 |
+
| gu | 29347 | 0.45% | 298 | 0.40% | 29645 | 0.45% |
|
| 260 |
+
| vi | 28448 | 0.44% | 329 | 0.45% | 28777 | 0.44% |
|
| 261 |
+
| rm | 27668 | 0.43% | 252 | 0.34% | 27920 | 0.43% |
|
| 262 |
+
| bo | 25636 | 0.40% | 217 | 0.29% | 25853 | 0.40% |
|
| 263 |
+
| ug | 23932 | 0.37% | 213 | 0.29% | 24145 | 0.37% |
|
| 264 |
+
| dv | 22580 | 0.35% | 204 | 0.28% | 22784 | 0.35% |
|
| 265 |
+
| am | 22498 | 0.35% | 227 | 0.31% | 22725 | 0.35% |
|
| 266 |
+
| yo | 22441 | 0.35% | 229 | 0.31% | 22670 | 0.35% |
|
| 267 |
+
| my | 21832 | 0.34% | 210 | 0.28% | 22042 | 0.34% |
|
| 268 |
+
| so | 21058 | 0.33% | 201 | 0.27% | 21259 | 0.33% |
|
| 269 |
+
| km | 21064 | 0.33% | 187 | 0.25% | 21251 | 0.33% |
|
| 270 |
+
| sd | 20471 | 0.32% | 199 | 0.27% | 20670 | 0.32% |
|
| 271 |
+
| zu | 19688 | 0.30% | 186 | 0.25% | 19874 | 0.30% |
|
| 272 |
+
| lo | 18555 | 0.29% | 188 | 0.25% | 18743 | 0.29% |
|
| 273 |
+
| ti | 18116 | 0.28% | 193 | 0.26% | 18309 | 0.28% |
|
| 274 |
+
| ce | 16789 | 0.26% | 181 | 0.24% | 16970 | 0.26% |
|
| 275 |
+
| ny | 16544 | 0.26% | 159 | 0.22% | 16703 | 0.26% |
|
| 276 |
+
| gd | 14012 | 0.22% | 142 | 0.19% | 14154 | 0.22% |
|
| 277 |
+
| xh | 9373 | 0.15% | 96 | 0.13% | 9469 | 0.14% |
|
| 278 |
+
| om | 6113 | 0.09% | 55 | 0.07% | 6168 | 0.09% |
|
| 279 |
+
| sco | 3362 | 0.05% | 30 | 0.04% | 3392 | 0.05% |
|
| 280 |
+
| **total** | 6463786 | 100.00% | 73931 | 100.00% | 6537717 | 100.00% |
|
| 281 |
+
|
| 282 |
+
|
| 283 |
+
|
| 284 |
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
|
| 285 |
It achieves the following results on the evaluation set:
|
| 286 |
- Loss: 0.0345
|