DerivedFunction commited on
Commit
e5cf9c3
·
verified ·
1 Parent(s): 862babc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -0
README.md CHANGED
@@ -162,12 +162,125 @@ factors were used to simulate messy text, and to reduce single character bias on
162
  - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
163
  - Low chance of simulating OCR and messy text with character mutation.
164
 
 
165
  To generalize well on both the target language and code switching a circulumn is provided:
166
  - Pure documents 55%: Single language to learn its vocabulary
167
  - Homogenous 25%: Single language + one foreign sentence to learn simple code switching
168
  - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
169
  - Mixed 10%: Generic mix of any languages.
170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
172
  It achieves the following results on the evaluation set:
173
  - Loss: 0.0345
 
162
  - Random chance to change the casing of compatible language scripts, such as Latin and Cyrllic.
163
  - Low chance of simulating OCR and messy text with character mutation.
164
 
165
+
166
  To generalize well on both the target language and code switching a circulumn is provided:
167
  - Pure documents 55%: Single language to learn its vocabulary
168
  - Homogenous 25%: Single language + one foreign sentence to learn simple code switching
169
  - Spliced 10%: A foreign sentence is centered between two same-language sentence, with the first sentence's punctuation stripped, and second sentence's forced to be lowercased.
170
  - Mixed 10%: Generic mix of any languages.
171
 
172
+
173
+ | lang | train | train % | eval | eval % | all_splits | all_splits % |
174
+ | :--- | ---: | ---: | ---: | ---: | ---: | ---: |
175
+ | en | 158088 | 2.45% | 2785 | 3.77% | 160873 | 2.46% |
176
+ | ru | 125692 | 1.94% | 1915 | 2.59% | 127607 | 1.95% |
177
+ | es | 125219 | 1.94% | 1809 | 2.45% | 127028 | 1.94% |
178
+ | ja | 125212 | 1.94% | 1786 | 2.42% | 126998 | 1.94% |
179
+ | fr | 123594 | 1.91% | 1803 | 2.44% | 125397 | 1.92% |
180
+ | de | 121413 | 1.88% | 1714 | 2.32% | 123127 | 1.88% |
181
+ | zh | 120667 | 1.87% | 1745 | 2.36% | 122412 | 1.87% |
182
+ | pt | 119430 | 1.85% | 1749 | 2.37% | 121179 | 1.85% |
183
+ | it | 117478 | 1.82% | 1596 | 2.16% | 119074 | 1.82% |
184
+ | ar | 101539 | 1.57% | 1208 | 1.63% | 102747 | 1.57% |
185
+ | fi | 100922 | 1.56% | 1490 | 2.02% | 102412 | 1.57% |
186
+ | uk | 98233 | 1.52% | 1160 | 1.57% | 99393 | 1.52% |
187
+ | pl | 96600 | 1.49% | 1144 | 1.55% | 97744 | 1.50% |
188
+ | no | 93851 | 1.45% | 1471 | 1.99% | 95322 | 1.46% |
189
+ | hu | 92941 | 1.44% | 1077 | 1.46% | 94018 | 1.44% |
190
+ | tr | 92566 | 1.43% | 1053 | 1.42% | 93619 | 1.43% |
191
+ | nl | 91431 | 1.41% | 1067 | 1.44% | 92498 | 1.41% |
192
+ | he | 91496 | 1.42% | 832 | 1.13% | 92328 | 1.41% |
193
+ | cs | 91275 | 1.41% | 989 | 1.34% | 92264 | 1.41% |
194
+ | da | 87309 | 1.35% | 1223 | 1.65% | 88532 | 1.35% |
195
+ | lt | 86897 | 1.34% | 935 | 1.26% | 87832 | 1.34% |
196
+ | mk | 85658 | 1.33% | 824 | 1.11% | 86482 | 1.32% |
197
+ | eo | 83778 | 1.30% | 794 | 1.07% | 84572 | 1.29% |
198
+ | mr | 83264 | 1.29% | 744 | 1.01% | 84008 | 1.28% |
199
+ | ko | 81829 | 1.27% | 1005 | 1.36% | 82834 | 1.27% |
200
+ | hi | 81659 | 1.26% | 979 | 1.32% | 82638 | 1.26% |
201
+ | tl | 79985 | 1.24% | 966 | 1.31% | 80951 | 1.24% |
202
+ | hy | 76909 | 1.19% | 735 | 0.99% | 77644 | 1.19% |
203
+ | el | 75369 | 1.17% | 728 | 0.98% | 76097 | 1.16% |
204
+ | ro | 73559 | 1.14% | 763 | 1.03% | 74322 | 1.14% |
205
+ | is | 72356 | 1.12% | 987 | 1.34% | 73343 | 1.12% |
206
+ | sk | 71990 | 1.11% | 858 | 1.16% | 72848 | 1.11% |
207
+ | la | 70651 | 1.09% | 745 | 1.01% | 71396 | 1.09% |
208
+ | be | 70521 | 1.09% | 867 | 1.17% | 71388 | 1.09% |
209
+ | fa | 70584 | 1.09% | 717 | 0.97% | 71301 | 1.09% |
210
+ | bg | 69684 | 1.08% | 673 | 0.91% | 70357 | 1.08% |
211
+ | lv | 67627 | 1.05% | 691 | 0.93% | 68318 | 1.04% |
212
+ | ms | 66271 | 1.03% | 770 | 1.04% | 67041 | 1.03% |
213
+ | af | 64699 | 1.00% | 982 | 1.33% | 65681 | 1.00% |
214
+ | ckb | 64368 | 1.00% | 587 | 0.79% | 64955 | 0.99% |
215
+ | kk | 63640 | 0.98% | 621 | 0.84% | 64261 | 0.98% |
216
+ | eu | 63398 | 0.98% | 673 | 0.91% | 64071 | 0.98% |
217
+ | ka | 63201 | 0.98% | 523 | 0.71% | 63724 | 0.97% |
218
+ | mn | 62551 | 0.97% | 641 | 0.87% | 63192 | 0.97% |
219
+ | hr | 62427 | 0.97% | 711 | 0.96% | 63138 | 0.97% |
220
+ | oc | 62292 | 0.96% | 661 | 0.89% | 62953 | 0.96% |
221
+ | id | 62134 | 0.96% | 732 | 0.99% | 62866 | 0.96% |
222
+ | ky | 61634 | 0.95% | 637 | 0.86% | 62271 | 0.95% |
223
+ | ba | 61637 | 0.95% | 584 | 0.79% | 62221 | 0.95% |
224
+ | ur | 61550 | 0.95% | 578 | 0.78% | 62128 | 0.95% |
225
+ | th | 60731 | 0.94% | 576 | 0.78% | 61307 | 0.94% |
226
+ | bn | 60588 | 0.94% | 415 | 0.56% | 61003 | 0.93% |
227
+ | ps | 60342 | 0.93% | 533 | 0.72% | 60875 | 0.93% |
228
+ | sv | 59918 | 0.93% | 937 | 1.27% | 60855 | 0.93% |
229
+ | tt | 60177 | 0.93% | 634 | 0.86% | 60811 | 0.93% |
230
+ | pa | 60137 | 0.93% | 599 | 0.81% | 60736 | 0.93% |
231
+ | sw | 60148 | 0.93% | 558 | 0.75% | 60706 | 0.93% |
232
+ | kn | 60037 | 0.93% | 631 | 0.85% | 60668 | 0.93% |
233
+ | as | 59839 | 0.93% | 374 | 0.51% | 60213 | 0.92% |
234
+ | cy | 58188 | 0.90% | 655 | 0.89% | 58843 | 0.90% |
235
+ | jv | 57805 | 0.89% | 508 | 0.69% | 58313 | 0.89% |
236
+ | bs | 57399 | 0.89% | 655 | 0.89% | 58054 | 0.89% |
237
+ | ga | 57233 | 0.89% | 672 | 0.91% | 57905 | 0.89% |
238
+ | ca | 56547 | 0.87% | 606 | 0.82% | 57153 | 0.87% |
239
+ | gl | 55312 | 0.86% | 577 | 0.78% | 55889 | 0.85% |
240
+ | sl | 55017 | 0.85% | 598 | 0.81% | 55615 | 0.85% |
241
+ | ku | 54674 | 0.85% | 537 | 0.73% | 55211 | 0.84% |
242
+ | ne | 54102 | 0.84% | 440 | 0.60% | 54542 | 0.83% |
243
+ | uz | 53777 | 0.83% | 507 | 0.69% | 54284 | 0.83% |
244
+ | tg | 50762 | 0.79% | 502 | 0.68% | 51264 | 0.78% |
245
+ | br | 49263 | 0.76% | 554 | 0.75% | 49817 | 0.76% |
246
+ | et | 49249 | 0.76% | 511 | 0.69% | 49760 | 0.76% |
247
+ | lb | 48192 | 0.75% | 492 | 0.67% | 48684 | 0.74% |
248
+ | su | 48185 | 0.75% | 480 | 0.65% | 48665 | 0.74% |
249
+ | mt | 47694 | 0.74% | 446 | 0.60% | 48140 | 0.74% |
250
+ | sr | 47385 | 0.73% | 458 | 0.62% | 47843 | 0.73% |
251
+ | sq | 45528 | 0.70% | 514 | 0.70% | 46042 | 0.70% |
252
+ | ml | 43461 | 0.67% | 429 | 0.58% | 43890 | 0.67% |
253
+ | or | 41301 | 0.64% | 413 | 0.56% | 41714 | 0.64% |
254
+ | te | 40065 | 0.62% | 381 | 0.52% | 40446 | 0.62% |
255
+ | yi | 38484 | 0.60% | 353 | 0.48% | 38837 | 0.59% |
256
+ | ta | 35897 | 0.56% | 378 | 0.51% | 36275 | 0.55% |
257
+ | mg | 35133 | 0.54% | 342 | 0.46% | 35475 | 0.54% |
258
+ | si | 34611 | 0.54% | 343 | 0.46% | 34954 | 0.53% |
259
+ | gu | 29347 | 0.45% | 298 | 0.40% | 29645 | 0.45% |
260
+ | vi | 28448 | 0.44% | 329 | 0.45% | 28777 | 0.44% |
261
+ | rm | 27668 | 0.43% | 252 | 0.34% | 27920 | 0.43% |
262
+ | bo | 25636 | 0.40% | 217 | 0.29% | 25853 | 0.40% |
263
+ | ug | 23932 | 0.37% | 213 | 0.29% | 24145 | 0.37% |
264
+ | dv | 22580 | 0.35% | 204 | 0.28% | 22784 | 0.35% |
265
+ | am | 22498 | 0.35% | 227 | 0.31% | 22725 | 0.35% |
266
+ | yo | 22441 | 0.35% | 229 | 0.31% | 22670 | 0.35% |
267
+ | my | 21832 | 0.34% | 210 | 0.28% | 22042 | 0.34% |
268
+ | so | 21058 | 0.33% | 201 | 0.27% | 21259 | 0.33% |
269
+ | km | 21064 | 0.33% | 187 | 0.25% | 21251 | 0.33% |
270
+ | sd | 20471 | 0.32% | 199 | 0.27% | 20670 | 0.32% |
271
+ | zu | 19688 | 0.30% | 186 | 0.25% | 19874 | 0.30% |
272
+ | lo | 18555 | 0.29% | 188 | 0.25% | 18743 | 0.29% |
273
+ | ti | 18116 | 0.28% | 193 | 0.26% | 18309 | 0.28% |
274
+ | ce | 16789 | 0.26% | 181 | 0.24% | 16970 | 0.26% |
275
+ | ny | 16544 | 0.26% | 159 | 0.22% | 16703 | 0.26% |
276
+ | gd | 14012 | 0.22% | 142 | 0.19% | 14154 | 0.22% |
277
+ | xh | 9373 | 0.15% | 96 | 0.13% | 9469 | 0.14% |
278
+ | om | 6113 | 0.09% | 55 | 0.07% | 6168 | 0.09% |
279
+ | sco | 3362 | 0.05% | 30 | 0.04% | 3392 | 0.05% |
280
+ | **total** | 6463786 | 100.00% | 73931 | 100.00% | 6537717 | 100.00% |
281
+
282
+
283
+
284
  This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
285
  It achieves the following results on the evaluation set:
286
  - Loss: 0.0345