Update README.md
Browse files
README.md
CHANGED
|
@@ -8,6 +8,8 @@ The tokenizer is trained with only Khmer/English. The corpus trained with approx
|
|
| 8 |
|
| 9 |
Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
|
| 10 |
|
|
|
|
|
|
|
| 11 |
text_example = "αααααααΆααααα»ααΆααΆαααααααααααααΆααα·α
αα
ααΆαααααααα·α
αα
αα ααααα·ααααα·ααΆα"
|
| 12 |
|
| 13 |
[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
|
|
|
|
| 8 |
|
| 9 |
Tho model card has 7152 vocab size and its type is Byte Pair Encoding.
|
| 10 |
|
| 11 |
+
Based on the well-known tokenizers, it's clearly that non-English words do not exist much in the pretrained vocab size. Therefore, it's slightly impossible to do long text translation between one to another.
|
| 12 |
+
|
| 13 |
text_example = "αααααααΆααααα»ααΆααΆαααααααααααααΆααα·α
αα
ααΆαααααααα·α
αα
αα ααααα·ααααα·ααΆα"
|
| 14 |
|
| 15 |
[970, 273, 298, 420, 1583, 397, 284, 343, 259, 453, 397, 418, 1904, 259, 317]
|