Update README.md
Browse files
README.md
CHANGED
|
@@ -127,6 +127,49 @@ The model supports the following ISO-coded languages:
|
|
| 127 |
|
| 128 |
> Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi.
|
| 129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
## Evaluation
|
| 131 |
### The model scored the following on `papulca/language-identification`'s test set
|
| 132 |
|Language | Correct | Total | Accuracy |
|
|
|
|
| 127 |
|
| 128 |
> Note that Romanized versions of any language is not included in the training set, such as Romanized Russian, and Hindi.
|
| 129 |
|
| 130 |
+
The coverage is as follows from a sample:
|
| 131 |
+
|
| 132 |
+
Per-group coverage (examples / tokens):
|
| 133 |
+
English 47 examples | 3947 tokens
|
| 134 |
+
Russian 47 examples | 3665 tokens
|
| 135 |
+
German 58 examples | 4625 tokens
|
| 136 |
+
Japanese 50 examples | 4188 tokens
|
| 137 |
+
Chinese 60 examples | 4131 tokens
|
| 138 |
+
French 40 examples | 3723 tokens
|
| 139 |
+
Spanish 44 examples | 4756 tokens
|
| 140 |
+
Portuguese 27 examples | 2130 tokens
|
| 141 |
+
Italian 57 examples | 5178 tokens
|
| 142 |
+
Polish 25 examples | 1753 tokens
|
| 143 |
+
Dutch 44 examples | 3082 tokens
|
| 144 |
+
Turkish 35 examples | 2315 tokens
|
| 145 |
+
SoutheastAsianLatin 114 examples | 8861 tokens
|
| 146 |
+
CentralEuropeanLatin 125 examples | 9761 tokens
|
| 147 |
+
Korean 38 examples | 3958 tokens
|
| 148 |
+
EastSlavicCyrillic 85 examples | 7471 tokens
|
| 149 |
+
Arabic 45 examples | 2508 tokens
|
| 150 |
+
NordicCore 194 examples | 14094 tokens
|
| 151 |
+
BalkanCyrillic 71 examples | 6231 tokens
|
| 152 |
+
ArabicOther 92 examples | 8010 tokens
|
| 153 |
+
Hindi 33 examples | 3251 tokens
|
| 154 |
+
IndicOther 261 examples | 40630 tokens
|
| 155 |
+
CentralAsianCyrillic 57 examples | 3789 tokens
|
| 156 |
+
AfricanLatin 82 examples | 5910 tokens
|
| 157 |
+
OtherScripts 269 examples | 28603 tokens
|
| 158 |
+
|
| 159 |
+
Top token languages:
|
| 160 |
+
ml 8197
|
| 161 |
+
it 5178
|
| 162 |
+
ta 4903
|
| 163 |
+
he 4873
|
| 164 |
+
es 4756
|
| 165 |
+
de 4625
|
| 166 |
+
kn 4613
|
| 167 |
+
pa 4457
|
| 168 |
+
ja 4188
|
| 169 |
+
zh 4131
|
| 170 |
+
uk 4007
|
| 171 |
+
ko 3958
|
| 172 |
+
|
| 173 |
## Evaluation
|
| 174 |
### The model scored the following on `papulca/language-identification`'s test set
|
| 175 |
|Language | Correct | Total | Accuracy |
|